Sunday, October 7, 2007
Custom Lingo Document Clustering Application
The developed custom lingo document clustering application is a hybrid of the Carrot2 and KEA text mining applications. The system uses the information retrieval module of KEA, which allows for the efficient collection of text files from a local directory. The system then largely resembles a Carrot2 lingo-type application. Thus it uses the Porter Stemmer for stemming. It uses an adjusted stop word list, which encompasses all unique stop words used by KEA and Carrot2. This list has 577 unique stop words.
The outputs of this application are document cluster assignments, document assignment scores and cluster descriptions. It is necessary to create a module to collect all relevant information for clustering interpretation and evaluation. The custom output module appends human-defined labels and document descriptive data to the lingo clustering information. This is then written to an output text file in which each line represents a document, and document data is separated by commas. This schema allows for automated importing into excel, as it is consistent with the comma separated value (“CSV”) format. Excel can then be used to analyse this data.
Using the KEA API, Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.