Clustering algorithms for web applications: October 2007

Monday, October 29, 2007

Treatise Summary

The following image provides a summary of the work I undertook for my undergraduate engineering treatise. It attempts to answer and reconcile the following questions:

1. What I did?

2. How I did it?

3. Why I did it?

4. What were the outcomes of clustering?

5. What were my conclusions?

Sunday, October 14, 2007

Clustering Experiments

There are three principle clustering experiments:

1. 750 Documents: this experiment takes the concepts that have 50 or more documents. Only 15 groups satisfy this condition. I take 50 documents from each of these groups. Thus the 750 documents;

2. 9032 Documents: this is the complete corpus, as communicated on the post on 29 September. This corpus has 65 groups; and

3. 2894 Documents: this is the entire corpus less the two most commonly occuring groups - acq and earn. These account for 6138 documents, thus 2894 remaining and 63 groups.

Sunday, October 7, 2007

Custom Lingo Document Clustering Application

The developed custom lingo document clustering application is a hybrid of the Carrot2 and KEA text mining applications. The system uses the information retrieval module of KEA, which allows for the efficient collection of text files from a local directory. The system then largely resembles a Carrot2 lingo-type application. Thus it uses the Porter Stemmer for stemming. It uses an adjusted stop word list, which encompasses all unique stop words used by KEA and Carrot2. This list has 577 unique stop words.

The outputs of this application are document cluster assignments, document assignment scores and cluster descriptions. It is necessary to create a module to collect all relevant information for clustering interpretation and evaluation. The custom output module appends human-defined labels and document descriptive data to the lingo clustering information. This is then written to an output text file in which each line represents a document, and document data is separated by commas. This schema allows for automated importing into excel, as it is consistent with the comma separated value (“CSV”) format. Excel can then be used to analyse this data.

Using the KEA API, Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.

Clustering algorithms for web applications