Clustering algorithms for web applications

Monday, October 29, 2007

Treatise Summary

The following image provides a summary of the work I undertook for my undergraduate engineering treatise. It attempts to answer and reconcile the following questions:

1. What I did?

2. How I did it?

3. Why I did it?

4. What were the outcomes of clustering?

5. What were my conclusions?

Sunday, October 14, 2007

Clustering Experiments

There are three principle clustering experiments:

1. 750 Documents: this experiment takes the concepts that have 50 or more documents. Only 15 groups satisfy this condition. I take 50 documents from each of these groups. Thus the 750 documents;

2. 9032 Documents: this is the complete corpus, as communicated on the post on 29 September. This corpus has 65 groups; and

3. 2894 Documents: this is the entire corpus less the two most commonly occuring groups - acq and earn. These account for 6138 documents, thus 2894 remaining and 63 groups.

Sunday, October 7, 2007

Custom Lingo Document Clustering Application

The developed custom lingo document clustering application is a hybrid of the Carrot2 and KEA text mining applications. The system uses the information retrieval module of KEA, which allows for the efficient collection of text files from a local directory. The system then largely resembles a Carrot2 lingo-type application. Thus it uses the Porter Stemmer for stemming. It uses an adjusted stop word list, which encompasses all unique stop words used by KEA and Carrot2. This list has 577 unique stop words.

The outputs of this application are document cluster assignments, document assignment scores and cluster descriptions. It is necessary to create a module to collect all relevant information for clustering interpretation and evaluation. The custom output module appends human-defined labels and document descriptive data to the lingo clustering information. This is then written to an output text file in which each line represents a document, and document data is separated by commas. This schema allows for automated importing into excel, as it is consistent with the comma separated value (“CSV”) format. Excel can then be used to analyse this data.

Using the KEA API, Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.

Saturday, September 29, 2007

Removal of Fuzzy Documents

Fuzzy Documents refers to the documents from the Reuters-21450 dataset which were assigned more than one label. Since I am undertaking non-fuzzy clustering experiments it is necessary to remove these types of documents from the dataset.

What am I left with?

The above table illustrates the nature of this corpus transformation. The emerging corpus has 9032 documents which pertain to 65 unique clusters.

How are these clusters distributed?

In the above chart, only clusters with greater than 50 documents are included. That is, clusters whose documents represent greater than 0.6% of the corpus. These are the 'top 15' documents, and in total they account for 91.4% of all documents. The rest are allocated to other: 774 documents. The 'top two' clusters represent 68% of all documents, and thus this dataset is not normally distributed.

Tuesday, September 18, 2007

Text Mining Application

There are a number of well-known java-based open-source text mining application APIs inherent within the text mining research and communities. Of perhaps greatest importance are KEA (developed by the same engineering group as WEKA) and Carrot2 (discussed previously in blogs dated September 2, July 23 and indirectly through discussions of its clustering algorithm lingo).

KEA versus Carrot2

KEA stands for Keyphrase Extraction Algorithm and, as its name suggests, is used to extract keyphrases and words from text. It is a verbose system which allows for both controlled indexing and free indexing. The forma is concerned with extracting keyphrases from documents with reference to a controlled vocabulary providing for a more concept-based approach to extraction whilst the latter does not consider different term meaning e.g. lap-top will not be considered the same as note-book. These are natural language processing anomolies. KEA does not undertake clustering, but instead employs a supervised learning approach and builds a Baive Bayes learning model through a set of training documents with known keyphrases. For the purposes of training, phrase "term-frequency-inverse-document-frequency" (TF-IDF) weight as well as position within the document are considered. (El-Beltagy, 2006)

Carrot2 has been introduced previously. This scheme is very different to KEA, as it takes-on an unsupervised approach to text mining and phrase extraction. In terms of implementation however, there are many similarities in the initial stages of text mining. Consider the following:

The above diagram illustrates a summary of the information retrieval and document preprocessing implementation of text mining employed by both KEA and Carrot2 (however, with Carrot2 specifically geared for web search results). As eluded to on the post on September 2, Carrot2 explicitly uses Apache Lucene 2.2 (adjusted for web search resutls) to facilitate the above processes. Each broadly undertakes each of -> points depicted in the central box, except for the marked [] which relates to the KEA controlled indexing approach to text mining. The other marking within this box * ear-marks variation between the two text mining schemes and well as signing options within the individual schemes. Each is now considered:

1. Stemming: there are a number of implicit options available to both schemes with regards to stemming-

a. Lovins Stemmer: the most aggressive stemmer, with some 294 nested rules;

b. Porter Stemmer: this more subdued approach has some 37 rules;

c. Partial Porter Stemmer: the Porter Stemmer has 5 multi-part stages which allow the "miner" to be more or be less conservative in their stemming process. The first stage Porter stemmer is a popular methodology, handling basic plurals e.g. horses becomes horse, processes becomes process, men does not become man; and

d. No Stemmer.

2. Stop words: KEA defines some 499 stop-words, Carrot2 some 324 stop-words (adjusted lucene), lucene only some 33 and the Brown Corpus some 425 (Sharma et Raman, 2003).

3. Vocabularies: This is only relevant to KEA. The "miner" is able to dicate a dictionary, thesaurus or list of terms when undertaking controlled indexing. Also, in terms of implementation, these can be in either text form or resourse description format ("rdf"). A number of popular vocabularies are in circulation, including the integrated public servicevocabulary ("ipsv") and the agrovoc vocoabulary.

How do these options affect clustering?

There are essentially horses for courses. That is, a particular data set or text mining experiment may require more aggressive or more conservative preprocessing.

How do these systems relate to my thesis?

I am now working on the specific implementation of clustering: LSA/SVD & description comes first - replacing KEA's NaiveBayes classifier with Carrot2's lingo algorithm. These frameworks are both java based and open-source.

KEA provides a very suitable fit with my experiments, as it is not based on an index of Internet search results but instead a directory of text files. Carrot2's lingo is an unsupervised approach to learning and hence allows for clustering. A potential shortfall of this approach is that lingo uses LSA/SVD and hence forms a term-document matrix- thus unsuitable for scalability considering the size of my proposed Reuters-21450 dataset (see also post relating to pre-preprocssing of this dataset). I have a number of papers which focus on such constraints (i.e. using smaller term dictionaries, more aggressive stemming [as eluded to above], retricting the number of identified phrases - either by increasing minimum phrase length or increasing maximum phrase length, etc). This leads to, adapting Carrot2 for the purposes of larger datasets.

What are the limitations of such an approach?

Initial complexities associated with such an approach:

1. The purpose of Lingo2 is to create more understandable cluster labels, thus a tension exists between label quality (as well as associated cluster quality in description comes first) and scalability;

2. Even with very aggressive stemming, scalability may still be an issue -that is, the size of the dataset too large for such experiments. I could therefore reduce the number of concepts and associated documents.

Alternatively, I could adopt a different dataset. In so much I could undertake similar experiments to those reported on the KEA website. In these experiments, the group use the controlled agrovac vocabulary to extract key phrases from agricultural documents. I could extract phrases using the unsupervised KEA algorithm and compare the keyphrases as extracted using the adapted carrot2 scheme. That is, using the same extraction and preprocessing techniques and allowing for different clustering and resultant keyphrase extraction outcomes; and

3. Measuring cluster label quality and associated cluster quality - I have undertaking further research into this issue beyond my previous research.

Monday, September 3, 2007

Preprocessing Reuters-21450 Dataset

The Reuters-21450 dataset is a text file with a number of tags identifying the individual parts of the each article. However, this quasi-structure is not separated by fixed part lengths - thus the need to apply logic to separate each article into individual text documents.

I was able to facilitate this process through Excel and VBA. I copied the Reuters document into Excel, and was able to process its structure via its unique tags. This allowed me to build a table with each documents unique parts across the rows of the table. I was then able to write the bodies of each off these documents to individual text documents, whose filename is its article ID number. This process is illustrated in the following picture.

The end result of the application of this process is a repository of 10,621 text files - each with its corresponding article ID as its filename.

This provides the basic data for the next part of preprocessing which includes parsing, the removal of stop words and the stemming of terms. Terms in my approach to text mining include both keywords and key phrases.

It is worth noting that the table created in this initial preprocessing phase will form the fundamental cornerstone for the clustering process - that is, it relates each document to its concept(s). It is used for the purpose of comparing clustering outcomes with human-defined reality.

Sunday, September 2, 2007

Proposed Carrot2 Framework

The developed Carrot2 System is a modularised java-based program. It extensively employs Carrot2 java libraries and API references. It uses Carrot2 embedded Apache Lucene 2.2 to automatically handle a lot of the data pre-processing, specifically employing the Porter stemmer for stemming. The developed uses Carrot2 System Weiss developed parser which accounts for “names, e-mail addresses, web pages and such.” (Weiss and Osinski, 2007b)

Using Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.