Data mining is the process of abstracting patterns from data so as to learn something interesting or to make nontrivial predictions on unseen data (Frank, 2005). There are two learning paradigms employed in data mining systems, namely supervised and unsupervised learning. The forma relates to the use of data containing some metadata while the latter is concerned with unseen data within data mining applications.
Unsupervised learning employs clustering for the purposes of abstracting patterns from data. Clusters are groups of similar data points. Clustering algorithms are well-defined processes that facilitate learning. They generally identify similar data points, label points and generate cluster labels.
There is a dichotomy in data mining problems. More traditionally, data of a numerical or structured nature was employed within learning schemes. However in the wake of the Internet there has been a reversion toward unstructured or textual-type data problems, which has in turn led to the surfacing of text mining as a sub-type of data mining. Mining text-based datasets requires additional consideration due to their apparent randomness and embedded semantic meanings.
Text mining requires that the acquired unstructured data undertake pre-processing so as to produce “some kind of semi-structured representation” for use in the discovery phase. (Baeza-Yates, 2001) Generally, pre-processing involves transforming text-based datasets (corpuses) into matrices pertaining to documents, terms and document-term weightings. (Weiss, 2005) In recent years, there has been a lot of research centring the optimisation of these operations. A popular approach to text mining that employs this type of scheme is latent semantic analysis (“LSA”).
LSA is a text mining method for “inducing and representing aspects of the meaning of words and passages reflected in their usage”. (Schreiner, 1997) LSA converts a corpus into a term-document matrix where matrix cells are the frequencies of the term in a given document. Central to the formation of this matrix is the need to first identify a dictionary of relevant terms from the collection of documents. (Hampp, 1999) This approach is the “bag of words method” which does not take into account the word order nor word context within documents. (Schreiner, 1997) Landauer et al found LSA to perform just as well as human judges in classification. (Ibid) Recent work in text mining has moved toward developing schemes that tend to use “more complete representations than just key words ... [as there is a belief] that these new representations will expand the kinds of discovered knowledge”. (Baeza-Yates, 2001) Karray et al proposed a concept-based mining model that captures semantics of text through analysis of both the sentence and document. (Kamel, 2006) They demonstrated that their method enhanced the automatic clustering of documents. (Ibid) Osinski and Weiss developed the lingo algorithm in order to improve the quality of cluster labels. They employed a description-comes-first approach to clustering in which “good, conceptually varied cluster labels” are found prior to assigning “documents to the labels to form groups”. (Weiss, 2005) The labels are deducted through key phrase extraction. They found that key phrases to be frequent phrases and performed extraction through a version of SHOC’s phrase extraction algorithm. (Ibid) In this paper, I intend to improve the automated extraction of key phrases from corpora using identified semantically important words for the purposes of forming better semi-structured data for clustering.