Clustering algorithms for web applications: September 2007

Saturday, September 29, 2007

Removal of Fuzzy Documents

Fuzzy Documents refers to the documents from the Reuters-21450 dataset which were assigned more than one label. Since I am undertaking non-fuzzy clustering experiments it is necessary to remove these types of documents from the dataset.

What am I left with?

The above table illustrates the nature of this corpus transformation. The emerging corpus has 9032 documents which pertain to 65 unique clusters.

How are these clusters distributed?

In the above chart, only clusters with greater than 50 documents are included. That is, clusters whose documents represent greater than 0.6% of the corpus. These are the 'top 15' documents, and in total they account for 91.4% of all documents. The rest are allocated to other: 774 documents. The 'top two' clusters represent 68% of all documents, and thus this dataset is not normally distributed.

Tuesday, September 18, 2007

Text Mining Application

There are a number of well-known java-based open-source text mining application APIs inherent within the text mining research and communities. Of perhaps greatest importance are KEA (developed by the same engineering group as WEKA) and Carrot2 (discussed previously in blogs dated September 2, July 23 and indirectly through discussions of its clustering algorithm lingo).

KEA versus Carrot2

KEA stands for Keyphrase Extraction Algorithm and, as its name suggests, is used to extract keyphrases and words from text. It is a verbose system which allows for both controlled indexing and free indexing. The forma is concerned with extracting keyphrases from documents with reference to a controlled vocabulary providing for a more concept-based approach to extraction whilst the latter does not consider different term meaning e.g. lap-top will not be considered the same as note-book. These are natural language processing anomolies. KEA does not undertake clustering, but instead employs a supervised learning approach and builds a Baive Bayes learning model through a set of training documents with known keyphrases. For the purposes of training, phrase "term-frequency-inverse-document-frequency" (TF-IDF) weight as well as position within the document are considered. (El-Beltagy, 2006)

Carrot2 has been introduced previously. This scheme is very different to KEA, as it takes-on an unsupervised approach to text mining and phrase extraction. In terms of implementation however, there are many similarities in the initial stages of text mining. Consider the following:

The above diagram illustrates a summary of the information retrieval and document preprocessing implementation of text mining employed by both KEA and Carrot2 (however, with Carrot2 specifically geared for web search results). As eluded to on the post on September 2, Carrot2 explicitly uses Apache Lucene 2.2 (adjusted for web search resutls) to facilitate the above processes. Each broadly undertakes each of -> points depicted in the central box, except for the marked [] which relates to the KEA controlled indexing approach to text mining. The other marking within this box * ear-marks variation between the two text mining schemes and well as signing options within the individual schemes. Each is now considered:

1. Stemming: there are a number of implicit options available to both schemes with regards to stemming-

a. Lovins Stemmer: the most aggressive stemmer, with some 294 nested rules;

b. Porter Stemmer: this more subdued approach has some 37 rules;

c. Partial Porter Stemmer: the Porter Stemmer has 5 multi-part stages which allow the "miner" to be more or be less conservative in their stemming process. The first stage Porter stemmer is a popular methodology, handling basic plurals e.g. horses becomes horse, processes becomes process, men does not become man; and

d. No Stemmer.

2. Stop words: KEA defines some 499 stop-words, Carrot2 some 324 stop-words (adjusted lucene), lucene only some 33 and the Brown Corpus some 425 (Sharma et Raman, 2003).

3. Vocabularies: This is only relevant to KEA. The "miner" is able to dicate a dictionary, thesaurus or list of terms when undertaking controlled indexing. Also, in terms of implementation, these can be in either text form or resourse description format ("rdf"). A number of popular vocabularies are in circulation, including the integrated public servicevocabulary ("ipsv") and the agrovoc vocoabulary.

How do these options affect clustering?

There are essentially horses for courses. That is, a particular data set or text mining experiment may require more aggressive or more conservative preprocessing.

How do these systems relate to my thesis?

I am now working on the specific implementation of clustering: LSA/SVD & description comes first - replacing KEA's NaiveBayes classifier with Carrot2's lingo algorithm. These frameworks are both java based and open-source.

KEA provides a very suitable fit with my experiments, as it is not based on an index of Internet search results but instead a directory of text files. Carrot2's lingo is an unsupervised approach to learning and hence allows for clustering. A potential shortfall of this approach is that lingo uses LSA/SVD and hence forms a term-document matrix- thus unsuitable for scalability considering the size of my proposed Reuters-21450 dataset (see also post relating to pre-preprocssing of this dataset). I have a number of papers which focus on such constraints (i.e. using smaller term dictionaries, more aggressive stemming [as eluded to above], retricting the number of identified phrases - either by increasing minimum phrase length or increasing maximum phrase length, etc). This leads to, adapting Carrot2 for the purposes of larger datasets.

What are the limitations of such an approach?

Initial complexities associated with such an approach:

1. The purpose of Lingo2 is to create more understandable cluster labels, thus a tension exists between label quality (as well as associated cluster quality in description comes first) and scalability;

2. Even with very aggressive stemming, scalability may still be an issue -that is, the size of the dataset too large for such experiments. I could therefore reduce the number of concepts and associated documents.

Alternatively, I could adopt a different dataset. In so much I could undertake similar experiments to those reported on the KEA website. In these experiments, the group use the controlled agrovac vocabulary to extract key phrases from agricultural documents. I could extract phrases using the unsupervised KEA algorithm and compare the keyphrases as extracted using the adapted carrot2 scheme. That is, using the same extraction and preprocessing techniques and allowing for different clustering and resultant keyphrase extraction outcomes; and

3. Measuring cluster label quality and associated cluster quality - I have undertaking further research into this issue beyond my previous research.

Monday, September 3, 2007

Preprocessing Reuters-21450 Dataset

The Reuters-21450 dataset is a text file with a number of tags identifying the individual parts of the each article. However, this quasi-structure is not separated by fixed part lengths - thus the need to apply logic to separate each article into individual text documents.

I was able to facilitate this process through Excel and VBA. I copied the Reuters document into Excel, and was able to process its structure via its unique tags. This allowed me to build a table with each documents unique parts across the rows of the table. I was then able to write the bodies of each off these documents to individual text documents, whose filename is its article ID number. This process is illustrated in the following picture.

The end result of the application of this process is a repository of 10,621 text files - each with its corresponding article ID as its filename.

This provides the basic data for the next part of preprocessing which includes parsing, the removal of stop words and the stemming of terms. Terms in my approach to text mining include both keywords and key phrases.

It is worth noting that the table created in this initial preprocessing phase will form the fundamental cornerstone for the clustering process - that is, it relates each document to its concept(s). It is used for the purpose of comparing clustering outcomes with human-defined reality.

Sunday, September 2, 2007

Proposed Carrot2 Framework

The developed Carrot2 System is a modularised java-based program. It extensively employs Carrot2 java libraries and API references. It uses Carrot2 embedded Apache Lucene 2.2 to automatically handle a lot of the data pre-processing, specifically employing the Porter stemmer for stemming. The developed uses Carrot2 System Weiss developed parser which accounts for “names, e-mail addresses, web pages and such.” (Weiss and Osinski, 2007b)

Using Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.

Clustering Experiments Dataset

Data acquisition in text mining experiments is concerned with the acquisition of information to be employed within the mining process itself. For the purposes of this paper, Reuters-21450 news corpus was the natural selection for undertaking data mining experiments. It is seen to be the benchmarking standard for automated text categorization, it is “estimated that this corpus has provided data for over 100 published research papers, particularly in the fields of information retrieval, natural language processing and machine learning.” (Rose et al., 2002, Weiss et al., 1999) The corpus relates to 10,788 news stories containing some 24,240 unique terms after stemming and the removal of stop words. In order to bench mark the outcomes of text mining algorithms, each story or document belongs to a category (on average 1.3 categories per document). There are 90 identified categories. The data is split into two sub-datasets according to ModApteSplit. (Lan et al., 2001) Thus, 75 per cent or 7769 documents of the corpus are taken to be the training set whilst the remaining 25 per cent or 3019 documents form the testing set.

Each document within the Reuters-21450 corpus has a title, a content section, a cluster assignment or a number of cluster assignments and an indicator of length. The corpus is partitioned into these documents and their respectable parts through specific text-based tags.

Clustering algorithms for web applications