Clustering algorithms for web applications: 2007

Monday, October 29, 2007

Treatise Summary

The following image provides a summary of the work I undertook for my undergraduate engineering treatise. It attempts to answer and reconcile the following questions:

1. What I did?

2. How I did it?

3. Why I did it?

4. What were the outcomes of clustering?

5. What were my conclusions?

Sunday, October 14, 2007

Clustering Experiments

There are three principle clustering experiments:

1. 750 Documents: this experiment takes the concepts that have 50 or more documents. Only 15 groups satisfy this condition. I take 50 documents from each of these groups. Thus the 750 documents;

2. 9032 Documents: this is the complete corpus, as communicated on the post on 29 September. This corpus has 65 groups; and

3. 2894 Documents: this is the entire corpus less the two most commonly occuring groups - acq and earn. These account for 6138 documents, thus 2894 remaining and 63 groups.

Sunday, October 7, 2007

Custom Lingo Document Clustering Application

The developed custom lingo document clustering application is a hybrid of the Carrot2 and KEA text mining applications. The system uses the information retrieval module of KEA, which allows for the efficient collection of text files from a local directory. The system then largely resembles a Carrot2 lingo-type application. Thus it uses the Porter Stemmer for stemming. It uses an adjusted stop word list, which encompasses all unique stop words used by KEA and Carrot2. This list has 577 unique stop words.

The outputs of this application are document cluster assignments, document assignment scores and cluster descriptions. It is necessary to create a module to collect all relevant information for clustering interpretation and evaluation. The custom output module appends human-defined labels and document descriptive data to the lingo clustering information. This is then written to an output text file in which each line represents a document, and document data is separated by commas. This schema allows for automated importing into excel, as it is consistent with the comma separated value (“CSV”) format. Excel can then be used to analyse this data.

Using the KEA API, Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.

Saturday, September 29, 2007

Removal of Fuzzy Documents

Fuzzy Documents refers to the documents from the Reuters-21450 dataset which were assigned more than one label. Since I am undertaking non-fuzzy clustering experiments it is necessary to remove these types of documents from the dataset.

What am I left with?

The above table illustrates the nature of this corpus transformation. The emerging corpus has 9032 documents which pertain to 65 unique clusters.

How are these clusters distributed?

In the above chart, only clusters with greater than 50 documents are included. That is, clusters whose documents represent greater than 0.6% of the corpus. These are the 'top 15' documents, and in total they account for 91.4% of all documents. The rest are allocated to other: 774 documents. The 'top two' clusters represent 68% of all documents, and thus this dataset is not normally distributed.

Tuesday, September 18, 2007

Text Mining Application

There are a number of well-known java-based open-source text mining application APIs inherent within the text mining research and communities. Of perhaps greatest importance are KEA (developed by the same engineering group as WEKA) and Carrot2 (discussed previously in blogs dated September 2, July 23 and indirectly through discussions of its clustering algorithm lingo).

KEA versus Carrot2

KEA stands for Keyphrase Extraction Algorithm and, as its name suggests, is used to extract keyphrases and words from text. It is a verbose system which allows for both controlled indexing and free indexing. The forma is concerned with extracting keyphrases from documents with reference to a controlled vocabulary providing for a more concept-based approach to extraction whilst the latter does not consider different term meaning e.g. lap-top will not be considered the same as note-book. These are natural language processing anomolies. KEA does not undertake clustering, but instead employs a supervised learning approach and builds a Baive Bayes learning model through a set of training documents with known keyphrases. For the purposes of training, phrase "term-frequency-inverse-document-frequency" (TF-IDF) weight as well as position within the document are considered. (El-Beltagy, 2006)

Carrot2 has been introduced previously. This scheme is very different to KEA, as it takes-on an unsupervised approach to text mining and phrase extraction. In terms of implementation however, there are many similarities in the initial stages of text mining. Consider the following:

The above diagram illustrates a summary of the information retrieval and document preprocessing implementation of text mining employed by both KEA and Carrot2 (however, with Carrot2 specifically geared for web search results). As eluded to on the post on September 2, Carrot2 explicitly uses Apache Lucene 2.2 (adjusted for web search resutls) to facilitate the above processes. Each broadly undertakes each of -> points depicted in the central box, except for the marked [] which relates to the KEA controlled indexing approach to text mining. The other marking within this box * ear-marks variation between the two text mining schemes and well as signing options within the individual schemes. Each is now considered:

1. Stemming: there are a number of implicit options available to both schemes with regards to stemming-

a. Lovins Stemmer: the most aggressive stemmer, with some 294 nested rules;

b. Porter Stemmer: this more subdued approach has some 37 rules;

c. Partial Porter Stemmer: the Porter Stemmer has 5 multi-part stages which allow the "miner" to be more or be less conservative in their stemming process. The first stage Porter stemmer is a popular methodology, handling basic plurals e.g. horses becomes horse, processes becomes process, men does not become man; and

d. No Stemmer.

2. Stop words: KEA defines some 499 stop-words, Carrot2 some 324 stop-words (adjusted lucene), lucene only some 33 and the Brown Corpus some 425 (Sharma et Raman, 2003).

3. Vocabularies: This is only relevant to KEA. The "miner" is able to dicate a dictionary, thesaurus or list of terms when undertaking controlled indexing. Also, in terms of implementation, these can be in either text form or resourse description format ("rdf"). A number of popular vocabularies are in circulation, including the integrated public servicevocabulary ("ipsv") and the agrovoc vocoabulary.

How do these options affect clustering?

There are essentially horses for courses. That is, a particular data set or text mining experiment may require more aggressive or more conservative preprocessing.

How do these systems relate to my thesis?

I am now working on the specific implementation of clustering: LSA/SVD & description comes first - replacing KEA's NaiveBayes classifier with Carrot2's lingo algorithm. These frameworks are both java based and open-source.

KEA provides a very suitable fit with my experiments, as it is not based on an index of Internet search results but instead a directory of text files. Carrot2's lingo is an unsupervised approach to learning and hence allows for clustering. A potential shortfall of this approach is that lingo uses LSA/SVD and hence forms a term-document matrix- thus unsuitable for scalability considering the size of my proposed Reuters-21450 dataset (see also post relating to pre-preprocssing of this dataset). I have a number of papers which focus on such constraints (i.e. using smaller term dictionaries, more aggressive stemming [as eluded to above], retricting the number of identified phrases - either by increasing minimum phrase length or increasing maximum phrase length, etc). This leads to, adapting Carrot2 for the purposes of larger datasets.

What are the limitations of such an approach?

Initial complexities associated with such an approach:

1. The purpose of Lingo2 is to create more understandable cluster labels, thus a tension exists between label quality (as well as associated cluster quality in description comes first) and scalability;

2. Even with very aggressive stemming, scalability may still be an issue -that is, the size of the dataset too large for such experiments. I could therefore reduce the number of concepts and associated documents.

Alternatively, I could adopt a different dataset. In so much I could undertake similar experiments to those reported on the KEA website. In these experiments, the group use the controlled agrovac vocabulary to extract key phrases from agricultural documents. I could extract phrases using the unsupervised KEA algorithm and compare the keyphrases as extracted using the adapted carrot2 scheme. That is, using the same extraction and preprocessing techniques and allowing for different clustering and resultant keyphrase extraction outcomes; and

3. Measuring cluster label quality and associated cluster quality - I have undertaking further research into this issue beyond my previous research.

Monday, September 3, 2007

Preprocessing Reuters-21450 Dataset

The Reuters-21450 dataset is a text file with a number of tags identifying the individual parts of the each article. However, this quasi-structure is not separated by fixed part lengths - thus the need to apply logic to separate each article into individual text documents.

I was able to facilitate this process through Excel and VBA. I copied the Reuters document into Excel, and was able to process its structure via its unique tags. This allowed me to build a table with each documents unique parts across the rows of the table. I was then able to write the bodies of each off these documents to individual text documents, whose filename is its article ID number. This process is illustrated in the following picture.

The end result of the application of this process is a repository of 10,621 text files - each with its corresponding article ID as its filename.

This provides the basic data for the next part of preprocessing which includes parsing, the removal of stop words and the stemming of terms. Terms in my approach to text mining include both keywords and key phrases.

It is worth noting that the table created in this initial preprocessing phase will form the fundamental cornerstone for the clustering process - that is, it relates each document to its concept(s). It is used for the purpose of comparing clustering outcomes with human-defined reality.

Sunday, September 2, 2007

Proposed Carrot2 Framework

The developed Carrot2 System is a modularised java-based program. It extensively employs Carrot2 java libraries and API references. It uses Carrot2 embedded Apache Lucene 2.2 to automatically handle a lot of the data pre-processing, specifically employing the Porter stemmer for stemming. The developed uses Carrot2 System Weiss developed parser which accounts for “names, e-mail addresses, web pages and such.” (Weiss and Osinski, 2007b)

Using Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.

Clustering Experiments Dataset

Data acquisition in text mining experiments is concerned with the acquisition of information to be employed within the mining process itself. For the purposes of this paper, Reuters-21450 news corpus was the natural selection for undertaking data mining experiments. It is seen to be the benchmarking standard for automated text categorization, it is “estimated that this corpus has provided data for over 100 published research papers, particularly in the fields of information retrieval, natural language processing and machine learning.” (Rose et al., 2002, Weiss et al., 1999) The corpus relates to 10,788 news stories containing some 24,240 unique terms after stemming and the removal of stop words. In order to bench mark the outcomes of text mining algorithms, each story or document belongs to a category (on average 1.3 categories per document). There are 90 identified categories. The data is split into two sub-datasets according to ModApteSplit. (Lan et al., 2001) Thus, 75 per cent or 7769 documents of the corpus are taken to be the training set whilst the remaining 25 per cent or 3019 documents form the testing set.

Each document within the Reuters-21450 corpus has a title, a content section, a cluster assignment or a number of cluster assignments and an indicator of length. The corpus is partitioned into these documents and their respectable parts through specific text-based tags.

Saturday, August 25, 2007

Paper References

I have received a number enquiries about the research I have undertaken thus far in my project. In response, I have decided to post the resources I have included to date:

1. AGHAGOLZADEH, M., SOLTANIAN-ZADEH, H. & ARAABI, B. N. (2006) Finding the Number of Clusters in a Dataset Using an Information Theoretic Hierarchial Algorithm. IEEE, 1336-1339.
2. AHONEN, H., HEINONEN, O., KLEMETTINEN, M. & VERKAMO, A. I. (1997) Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collectors. Department of Computer Science. Helsinki, University of Helsinki.
3. AMARASIRI, R., CEDDIA, J. & ALAHAKOON, D. (2005) Exploratory Data Mining Lead by Text Mining Using a Novel High Dimensional Clustering Algorithm. Fourth International Conference on Machine Learning and Applications (ICMLA'05). IEEE Computer Society.
4. ATKINSON-ABUTRIDY, J., MELLISH, C. & AITKEN, S. (2004) Combining Information Extraction with Genetic Algorithms for Text Mining. IEEE Intelligent Systems, 22-30.
5. AUMANN, Y., FELDMAN, R., YEHUDA, Y. B., LANDAU, D., LIPHSTAT, O. & SCHLER, Y. (1999) Circle Graphs: New Visualization Tools for Text-Mining. J.M. Zytkow and J. Rauch (Eds): PKDD'99, 277-282.
6. CALVO, R., JOSE, J. & ADEVA, G. (2006) Mining Text with Pimiento. IEEE Internet Computing, 27- 35.
7. CHANG, H., HSU, C. & DENG, Y. (2004) Unsupervised Document Clustering Based on Keyword Clusters. International Symposium on Communications and Information Technologies 2004 (ISCIT 2004). Sapporo, Japan.
8. CHEN, J., YAN, J., ZHANG, B., YANG, Q. & CHEN, Z. (2006) Diverse Topic Phrase Extraction through Latent Semantic Analysis. Sixth International Conference on Data Mining (ICDM'06). IEEE Computer Society.
9. CODY, W., KREULEN, J., KRISHNA, V. & SPANGLER, W. (2002) The integration of business intelligence and knowledge management. IBM Systems Journal, 41, 697-713.
11. EL-BELTAGY, S. R. (2006) KP-Miner: A Simple System for Effective Keyphrase Extraction. IEEE, 1-5.
12. FAN, W., WALLACE, L. & RICH, S. (2006) Tapping the Power of Text Mining. Communications of the ACM, 49, 77-82.
13. GLECH, D. & ZHUKOV, L. (2003) SVD Subspace Projections for Term Suggestion Ranking and Clustering. Claremont, California, Harvey Mudd College, Yahoo! Research Labs.
14. GRIMES, S. (2003) Decision Support: The Word on Text Mining. Intelligent Entreprise, 6, 12-13.
15. HOFMANN, D. G. T. (2006) Non-redundant data clustering. Knowledge and Information Systems, 1-24.
16. HSU, H. C. C. (2005) Using Topic Keyword Clusters for Automatic Document Clustering. Third International Conference on Information Technology and Applications (ICITA'05). IEEE.
17. IIRITANO, S. & RUFFOLO, S. (2001) Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining. IEEE, 454-458.
18. JAIN, A. K., MURTY, M. N. & FLYNN, P. J. (1999) Data Clustering: A Review. ACM Computing Serveys, 31, 264-323.
19. JENSEN, R., II, K. E. H., ERDOGMUS, D., PRINCIPE, J. C. & ELTOFT, T. (2003) Clustering using Renyi's Entropy. IEEE, 523-528.
20. LAN, M., SUNG, S., LOW, H. & TAN, C. (2001) A Comparative Study on Term Weighting Schemes for Text Categorization. Department of Computer Science. Singapore, National University of Singapore.
21. LANDAUER, T. K. & DUMAIS, S. T. (1997) A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Pyschological Review, 104, 211-240.
22. LANDAUER, T. K., LAHAM, D., REHDER, B. & SCHREINER, M. E. (1997) How Well can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans.
23. MOTES-Y-GOMEZ, M., A.GELBUKH, LOPEZ-LOPEZ, A. & BAEZA-YATES, R. (2001) Text Mining with Conceptual Graphs. IEEE, 898-903.
24. ONG, S. A. J. (2000) A Data Mining Strategy for Inductive Data Clustering: A Synergy Between Self Organising Neural Networks and K-Means Clustering Techniques. IEEE.
25. OSINSKI, S. (2006) Improving Quality of Search Results Clustering with Approximate Matrix Factorisations. M. Lalmas et al. (Eds.): ECIR 2006, 167-178.
26. OSINSKI, S. & WEISS, D. (2005) A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems, 48-54.
27. QIAN, Y. & SUEN, C. Y. (2000) Clustering Combination Method. IEEE, 732-735.
28. SHARMA, R. & RAMAN, S. (2003) Phrase-based Text Representation for Managing Web Documents. International Conference on Information Technology: Computers and Communication (ITCC'03). IEEE Computer Society.
29. SHEHATA, S., KARRAY, F. & KAMEL, M. (2006) Enhancing Text Clustering using Concept-based Mining Model. Sixth International Conference on Data Mining. IEEE Computer Society.
30. STEINBACH, M., KARYPIS, G. & KUMAR, V. (2006) A Comparison of Document Clustering Techniques. IEEE, 1-2.
31. TJHI, W. & CHEN, L. (2006) Flexible Fuzzy Co-Clustering with Feature-cluster Weighting. IEEE.
32. TSUJII, J. & ANANIADOU, S. (2005) Thesaurus or Logical Ontology, Which One Do We Need for Text Mining. Language Resources and Evaluation, 39, 77-90.
34. WEISS, S. M., APTE, C., DAMERAU, F. J., JOHNSON, D. E., J.OLES, F., GOETZ, T. & HAMPP, T. (1999) Maximizing Text-Mining Performance. IEEE, 63-69.
35. WIEMER-HASTINGS, P. & ZIPITRIA, I. (2000) Rules for Syntax, Vectors for Semantics. Edinburgh, University of Edinburgh.
36. WITTEN, I. H. & FRANK, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques, San Francisco, Morgan Kaufmann Publishers.
37. WONG, P., COWLEY, W., FOOTE, H., JURRUS, E. & THOMAS, J. (2000) Visualizing Sequential Patterns for Text Mining. IEEE Synposium on Information Visualisation 2000 (InfoVis'00).
38. WU, H. & GUNOPLOUS, D. (2002) Evaluating the Utiliy of Statistical Phrases and Latent Semantic Indexing for Text Classification. IEEE, 713-716.
39. YANG, H. & LEE, C. (2003) A Text Mining Approach on Automatic Generation of Web Directories and Hierarchies. IEEE/WIC International Conference on Web Intelligence (WI'03). IEEE.
40. YU, J. (2005) General C-Means Clustering Model. IEEE Computer Society, 1197-1211.
ZELIKOVITZ, S. & HIRSH, H. (2000) Using LSI for Text Classification in the Presence of Background Text. Piscataway, New Jersey, Rutgers University.
41. ZHONG, M., CHEN, Z. & LIN, Y. (2004) Using Classification and key phrase Extraction for information retrieval. IEEE, 3037-3041.

Paper Introduction

The ubiquitous adoption of information systems has resulted in an explosion of data. This phenomenon has been largely supported by a continually evolving technological landscape, fundamentally the Internet. The need for businesses and organisations to understand this data, in hand with the enabling progress in computer processing, has led to the birth of data mining.

Data mining is the process of abstracting patterns from data so as to learn something interesting or to make nontrivial predictions on unseen data (Frank, 2005). There are two learning paradigms employed in data mining systems, namely supervised and unsupervised learning. The forma relates to the use of data containing some metadata while the latter is concerned with unseen data within data mining applications.

Unsupervised learning employs clustering for the purposes of abstracting patterns from data. Clusters are groups of similar data points. Clustering algorithms are well-defined processes that facilitate learning. They generally identify similar data points, label points and generate cluster labels.

There is a dichotomy in data mining problems. More traditionally, data of a numerical or structured nature was employed within learning schemes. However in the wake of the Internet there has been a reversion toward unstructured or textual-type data problems, which has in turn led to the surfacing of text mining as a sub-type of data mining. Mining text-based datasets requires additional consideration due to their apparent randomness and embedded semantic meanings.

Text mining requires that the acquired unstructured data undertake pre-processing so as to produce “some kind of semi-structured representation” for use in the discovery phase. (Baeza-Yates, 2001) Generally, pre-processing involves transforming text-based datasets (corpuses) into matrices pertaining to documents, terms and document-term weightings. (Weiss, 2005) In recent years, there has been a lot of research centring the optimisation of these operations. A popular approach to text mining that employs this type of scheme is latent semantic analysis (“LSA”).

LSA is a text mining method for “inducing and representing aspects of the meaning of words and passages reflected in their usage”. (Schreiner, 1997) LSA converts a corpus into a term-document matrix where matrix cells are the frequencies of the term in a given document. Central to the formation of this matrix is the need to first identify a dictionary of relevant terms from the collection of documents. (Hampp, 1999) This approach is the “bag of words method” which does not take into account the word order nor word context within documents. (Schreiner, 1997) Landauer et al found LSA to perform just as well as human judges in classification. (Ibid) Recent work in text mining has moved toward developing schemes that tend to use “more complete representations than just key words ... [as there is a belief] that these new representations will expand the kinds of discovered knowledge”. (Baeza-Yates, 2001) Karray et al proposed a concept-based mining model that captures semantics of text through analysis of both the sentence and document. (Kamel, 2006) They demonstrated that their method enhanced the automatic clustering of documents. (Ibid) Osinski and Weiss developed the lingo algorithm in order to improve the quality of cluster labels. They employed a description-comes-first approach to clustering in which “good, conceptually varied cluster labels” are found prior to assigning “documents to the labels to form groups”. (Weiss, 2005) The labels are deducted through key phrase extraction. They found that key phrases to be frequent phrases and performed extraction through a version of SHOC’s phrase extraction algorithm. (Ibid) In this paper, I intend to improve the automated extraction of key phrases from corpora using identified semantically important words for the purposes of forming better semi-structured data for clustering.

Chapter two of this paper provides a background of related research, setting context and providing a platform for the remaining chapters. It is organised into three primary and interrelated sections. Chapter three outlines the design of the clustering algorithms for web applications experiments. Chapter four is the analysis and discussion of experiment results. Chapter five is the conclusions drawn from previous sections.

Monday, August 20, 2007

Background Structure

As alluded to in a previous post, the paper topic of "Clustering Algorithms for Web Applications" warrants the specific consideration of a number of interelated areas of research. Structuring the background section of the paper which has been structured into three primary sections:

1. Text Mining
1.1 Text Mining Introduction
1.2 Text Mining Process
1.2.1 Data Pre-Processing
1.2.2 Text Mining
1.2.3 Result Interpretation and Refinement
1.3 Text Mining Algorithms, Evolution and Evaluation
1.4 Text Mining Software Design

2. Clustering
2.1 Clustering Process
2.2 Clustering Theoretical Framework
2.3 Clustering Algorithms
2.4 Cluster Representation

2.5 Result Interpretation and Refinement

2.6 Clustering Development and Evaluation

3. Clustering in Text Mining
3.1 Clustering Algorithms in Text Mining
3.1.1 Clustering Algorithms in Text Mining Introduction
3.1.2 Clustering Algorithms in Text Mining Evolution and Evaluation
3.2 Clustering Algorithms for Web Applications

This section of the thesis is intended to set the context and provide a platform for the remaining sections. Considering the above structure provides a snapshot of where this paper sits in terms of previous contributions and illustrates my approach to the topic.

Saturday, August 18, 2007

Paper Abstract - as at 19 August 2007

The entrenchment of the Internet into modern society has led to a proliferation of unstructured information. The amount of unstructured information has also been compounded by the broad ubiquitous adoption of information systems. Hence there is an endemic and growing need to extract the knowledge hidden within collections of documents, thus the emergence of text mining. Unsupervised learning schemes allow for the abstraction of patterns in data, and are typically facilitated by clustering algorithms. Recent progress in computational processing has provided greater opportunity to develop and employ clustering algorithms in text mining, leading to a myriad of contributions. Modern search engines continue to move towards automated web page retrieval that is more efficient, and provides the searcher with understandable and relevant search results. This paper focuses on clustering algorithms for web applications. A critical review of current algorithms, their evolution and an evaluation is presented. This analysis forms a framework through which to consider the nature of this progress and some of the limitations of the current clustering technologies. A number of text mining experiments are replicated and extended, with a comprehensive discussion to further illustrate these issues.

Tuesday, July 31, 2007

Clustering Algorithms General Background

Clustering is the process of abstracting patterns from data sets. Moreover, it is the labelling of a data point as belonging to a group of similarly identified data points.

Clustering is facilitated by algorithms, which are well-defined instructions for carrying out some task. Clustering algorithms encompass the preprocessing of the data, extracting clusters from the data, generating descriptions (labels) of the identified clusters and any post-processes like assessing cluster quality.

The use of clustering type schemes has become somewhat ubiquitous in many industries. As an example, clustering is used by retailers to abstract customer types and related behaviours. Consider Joe Bloggs Clothes Shop. The managers of this shop are able to employ a range of clustering algorithms to isolate specific types of shoppers - perhaps within some customer loyalty program. That is, a clustering algorithm may identify a group of female shoppers who made greater than three high valued purchases in the previous summer season range. Coming into the next summer, Joe Bloggs is able to send these customers specific catalogues and promotions that will encourage these shoppers to return to the store and repeat their previous behaviour.

An area in which clustering and clustering algorithms alike were arguably forged is in web applications. Web search is perhaps the most obvious of these applications. A lot of modern day web search engines return results to search queries as ordered lists of site links, and small descriptions of the related site. However this can be a somewhat clumsy and incumberscent approach to search, especially in the case of a general search entry (see previous "rugby" example). An alternative approach is search result clustering, through which results are clustered into groups of similar results.

Clustering and clustering algorithms emerged with the advent of the technologies that facilitate them, obviously computers. Thus as a field of science, it is in its relative infancy. See evolution progression in the following chart.

In recent years there has been significant developments in clustering algorithms for web applications. There has also been an observable emergence of new applications for these algorithms. Recent advances in clustering have been in part due to broad multi-disciplinary contributions in the wake of wide adoption.

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge

This subject of this paper published in 1997 is perhaps at first impressions not as directly relevant to my project as the other papers mentioned previously. It has an emphasis on learning and the acquisition of knowledge from the viewpoint of philosophical science; however it candidly employs Latent Semantic Analysis (LSA) for the experimental purposes. Thus, Landauer and Dumais provide a thorough review of LSA and its application in real experiments.

What is LSA? It is a "high-dimensional linear associative model that embodies no human knowledge beyond its general learning mechanism". (p.211) Alternatively, LSA can be described in its “bare mathematical formalism” in the “singular-value-decomposition matrix model”. (p.218)

The singular-value-decomposition (SVD) realization of LSA is a “general method for linear decomposition of a matrix into independent principle components”. (p.218) There are a number of methods of SVD application, however within the experiments conducted for this report ‘tf-idf’ was employed.

Term-frequency inverse-document-frequency (tf-idf) is a formula that provides an illustration of the relative importance of a term within a document and its relative importance of this occurrence within the corpus. It should be noted that terms that occurred in “at least two samples” are of interest. (p.218) Thus, for the purpose of SVD, the data size is compressed in size.

LSA is then employed within numerous experiments so as to provide a means for mimicking human learning.

Improving quality of search results clustering with approximate matrix factorisations

In popular Internet search engines, results to search queries are typically returned as linear lists of document summaries - this the query-list paradigm. A fundamental constraint of this type of scheme is that the results are not summarised by relevant topics, lending to sometimes inefficient Internet search.

Consider the following search "rugby". The query-list paradigm would return a linear list ordering documents by some criteria. This general query however could relate to rugby league, rugby union, the rugby world cup or any number of other topics.

An alternative approach to ordered lists is search result clustering. Search results clustering would see the results returned in groups of similar results - providing the searcher with well-structured search results.

Search results clustering employ a class of algorithms known as post-retrieval document clustering algorithms. These algorithms identify all of the major and outlier topics inherent with search results (generate clusters), label the clusters and automicallygroup the results.

The author then identifies a number of search result clustering algorithms, however observes that none of these algorithms explicitly address the problem of cluster description quality. He proposes that the Lingo algorithm could lead to better cluster description quality.

The Lingo algorithm is said to be a description-comes-first approach to clustering, as meaningful cluster labels are identified prior to document assignment. Fundamental to this process is this algorithm's use of the matrix factorisation technique of Singular-Value Decomposition (SVD). This technique however is not ideal in all cases.

The premise of this very interesting 2006 paper is to evaluate the performance of different matrix factorisation techniques as parts of a description-comes-first search results clustering algorithm.

Matric Factorisations

In this paper, the author analyses four matrix factorisation methods:

1. Singular Value Decomposition (SVD);

2. Non-negative Matrix Decomposition (NMF);

3. Local Non-negative Matrix Decomposition (LNMF); and

4. Concept Decomposition (CD).

Experiment Setup

The experiments carried out by Osinski were performed using data sourced from the Open Directory Project (ODP). The data comprised of 77 data sets, each containing a mixture of documents orginating from two to eight manually selected ODP categories. 63 of these data sets, used in the topic separation experiment, were represented by an equal number of documents whilst the remaining 14 data sets, used in the outlier detection experiment, contained documents from four closely related ODP categories plus one or two categories dealing with a totally different subject.

Experiment Method

The tests were separated into three parts, namely:

1. Topic separation experiment;

2. Outlier detection experiment; and

3. Subjective label quality judgements.

Each of the previously mentioned factorisation techniques were employed. Also, a number of alternative search result clustering algorithms were tested: Lingo, Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). The second and third algorithms do not follow the description-comes-first paradigm.

Experiment Result Discussion

Factorisation Method Comparison

The experiments found the NMF to significantly outperform the other factorisation methods, with respect to topic and snippet coverage while maintaining similar levels of cluster contamination (grouping purity). Osinski provides as rationale that NMF, in contrast to SVD, produces non-negative base vectors allow for better matching of frequent phrases found in input snippets.

Also, the experiments found LNMF to perform worse with respect to cluster contamination. Osinski notes that the cluster labels generated by this factorisation method to be shorter and more general than for the other NMF methods.

Noteworthy is that the k-means-based factorisation was unable to detect any of the outliers, labelling only the major subjects. This was reasonsed to be because this means usually locates its centroids in most dense areas of input snippet space - away from the outliers.

SVD was the only factorisation technique to discover one of the smallest outliers.

Algorithm Comparison

The experiments revealed that the Lingo algorithm and its desciption-comes-first approach to search results clustering outperformed both STC and TRC is topic separation and outlier detection tests.

Saturday, July 28, 2007

How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans

This 1997 paper uses a series of experiments to compare the judged perceived quality and quantity of knowledge conveyed within short student essays by both humans and a computer. The premise of these experiments is communicated in the paper title, a 'bag of words' used by a computer to assess knowledge.

The computer uses a corpus-based statistical model known as Latent Semantic Analysis (LSA) for "inducing and representing aspects of the meaning of words and passages reflected in their usage" (p. 412). LSA relies on singular value decomposition (SVD) to transform a corpus (collection of textual documents) into a matrix of document-term frequency terms.

Landauer, Laham, Rehder and Schreiner undertake two well-defined experiments in which LSA and human experts were put to work evaluating scientific-type short essays. The result of these experiments were that "LSA-based measures were closely related to human judgments as the latter were to each other and LSA measures predicted external measures of the same knowledge as well or better than did the human judgements" (p.418).

Data Clustering: A Review

Jain, Murty and Flynn present a taxonomy of clustering techniques in their 1999 paper Data Clustering: A Review. The intention of the paper is to "survey the core concepts and techniques in the large subset of cluster analysis with its roots in statistics and decision theory" (p. 266).

In the introduction of the paper, the authors importantly define clustering as a means for unsupervised learning and discern this sciences' difference to supervised learning. Clustering facilitates the grouping of a given collection of unlabelled patterns into meaningful clusters (similar data points). This is accomplished through measuring proximity (similarity) between data points - through well defined algorithms. On the other hand, in supervised learning schemes the data set is already classified into labelled patterns. The problem is instead is to label a newly encountered.

The paper then moves onto identify and describe the principle components of a clustering task. These tasks can be separated into three primary steps and two additional if-needed type steps:

1. pattern representation: this step is concerned with the preprocesses of feature extraction and feature selection where the forma is the process of abstracting the most effective subset of the original data set to use in clustering while the latter relates to the use of transformation(s) of the original data set to produce a revised data set;

2. definition of a pattern proximity measure appropriate to the data domain e.g. Euclidean distance;

3. clustering or grouping stage: this is the crux of unsupervised learning. Clusters can be either hard (definite partitions between groups) or alternatively fuzzy (where a pattern has varying degree of membership in each of the output clusters);

4. data abstraction: this step relates to the process of extracting simply and compact descriptions of clustered patterns;

5. assessment of output: this post-process aims to use specific measures of optimality to review the validity of clustering outcomes. Typically, statistical techniques and hypothesis type tests are employed to facilitate this process.

The authors then delve into the motivation for their paper, stating that previous critical analysis of clustering algorithms had failed to provide a framework for assessing tradeoffs of different techniques and for making a decision about which algorithm(s) to employ in a particular application. The authors make the observation that clustering algorithms contain implicit assumptions about the nature of the distribution of a data set. That is, the diversity of the clustering algorithms available to practitioners is a condition of the impossibly large number of possible unsupervised learning problems in existance. Hence with reference to their previously mentioned framework, the motivation for the paper.

Data Clustering: A Review is loosely organised by preprocesses, similarity measures, a taxonomy of clustering methods and techniques, and clustering applications.

Preprocesses: pattern representation, feature selection and extraction

This short section of the paper presents a number of observations about these preprocesses. With respect to pattern representation, it is noted that good representation can lead to simple and easily understood clustering i.e. polar coordinate versus Cartesian coordinates for a curvilinear cluster.

Similarity Measures

Similarity is the fundamental concept of unsupervised learning. In practice, it is dissimilarity between two patterns that is typically employed within learning schemes. Dissimilarity is measured by a distance measure defined on the feature space.

The most popular form of distance measurement for continuous features is the Euclidean distance (Minkowski metric). This technique's popularity centres about its intuitive appeal - it is used to calculate distance between objects in two and three-dimensional space.

This section of the paper reviews the Euclidean distance technique and introduces and similarly analyses a number of other similarity measures. These include distance based on counts for nominal attributes, syntatic approaches and mutual neighbour distance (MND).

A Taxonomy of Clustering Methods and Techniques

For the purpose of my project, this in-depth section of the paper is by far the most relevant.

The taxonomy describes the different approaches to data mining with a well-defined hierarchy. However, before traversing this hierarchy and each node's clustering algorithms the authors discuss five issues affecting different approaches regardless of their placement in the taxonomy:

1. Agglomerative ("bottom-up") versus divisive ("top-down"): the forma relates to algorithmic structure and operation that begins with each data point in a distinct cluster, and iteratively condenses these clusters until some stop condition is met. The latter on the other hand, begins with all data points in a single cluster and the iteration process stops splitting the clusters upon some stop-condition;

2. Monothetic versus polythetic: in the forma approach all features (axes) are included in calculating distances between patterns, whilst in the latter features are considered sequentially. Note that in the polythetic approac 2^d clusters are spawn;

3. Hard versus fuzzy: in a hard approach a data point is allocated a single cluster, whilst in a fuzzy approach a data point is assigned degrees of membership in several clusters;

4. Deterministic versus stochastic: the forma approach sees optimization being sought through traditional techniques (well-defined) whilst the latter performs a random search of the state space consisting of all possible labelings; and

5. Incremental versus non-incremetal: incremental approaches to clustering are not concerned with minimizing scans through the data set while non-incremental are (more efficient approaches have emerged since larger data sets have emerged).

Now, onto the taxonomy.

A. Hierarchical Clustering Algorithms

Hierarchical clustering algorithms produce classifications that can be considered from multiple levels of abstraction. A dendrogram is a representation of this nested grouping.

i. Single-link algorithm: the distance between two clusters is the minimum distances between all pairs of data points drawn from the two clusters

ii. Complete-link algorithm: the distance between to clusters is the maximum of all pairwise distances between patterns in the two clusters.

Algorithm (i) has a tendency to produce elongated clusters (the chaining effect), whilst algorithm (ii) produces tightly bound clusters. This aside, (i) is overall thought to be versatile than (ii).

Three hierarchical clustering algorithms are then explained:

1. Agglomerative Single-Link Clustering Algorithm;

2. Agglomerative Complete-Link Clustering Algorithm; and

3. Hierarchical Agglomerative Clustering Algorithm.

B. Partitional Algorithms

Partitional clustering algorithms obtain single partitions of data sets instead of a clustering-type structure. They are robust in the sense that they are suitable for analysing large data sets, however the choice of the number of clusters may seem arbitrary.

These algorithms generally discover clusters by optimizing some criterion function. The most intuitive and frequently used partitional algorithm is the squared error criterion. The k-means is the simplest and most commonly used algorithm employing this criterion. This works by starting with some random initial partition and then iteratively reassigns data points to clusters "based on the similarity between the data point and the cluster centres until a convergence criterion is met" (p.278). This convergence criterion may be no reassignment of datapoints between clusters or a stable error term. Note worthy are the variations of the k-means algorithm. In particular, a robust form of this algorithm will see the splitting of clusters whose distance between data points in above some threshold and merge clusters whose distance is below this threshold.

Another type of partitional algorithm is graph-theoretic clustering. The best-known graph-theoretic clustering is based on the construction of the minimal spanning tree (MST) of the data, and then deleting the longest edges so as to form clusters.

The taxonomy then broadly introduces nearest neighbour clustering in which an unlabeled data point is assigned the cluster of its nearest labeled neighbour (within some distance bounds).

The next part of paper goes into representation of clusters, that is how the outcome of classification is infact summarised and communicated. Three representations were suggested:

1. Represent a cluster of data points by their centroid (when clusters are compact or isotropic) or by distant points in the cluster (when clusters are elongated or non-isotopic);

2. Represent clusters using nodes in a classification tree; and

3. Represent clusters by using conjunctive logical expressions.

The taxonomy arrives next at Artificial Neural Networks (ANNs) which are said to be massively parallel distributed networks of simple processing units i.e. neurons. ANNs are characterised by pertaining to only numeric vectors and which learn adaptively through shifting interconnecting semantic weights. Examples of ANNs include Kohonen's learning vector quantization (LVQ) and self-organising map (SOM).

The next group of algorithms introduced are evolutionary approaches to clustering. These algorithms employ the evolutionary operators of selection, recombination, and mutation in potential solution transformations. The best-known evolutionary techniques used in clustering are genetic algorithms (GAs), evolution strategies (ESs), and evolutionary programming (EP).

Some other unsupervised learning techniques brushed over include fuzzy clustering algorithms, mixture-resolving and mode-seeking algorithms as well as search-based approaches (like the branch-and-bound technique and clustering based on simulated annealing).

The taxonomy then moves into critical evaluation, providing a summary of previous studies comparing clustering techniques and their respective findings. The dimension of these comparisons largely relate to effectivenes (how good the clustering outcomes were), efficiency (the speed of the classification) and other data set related concerns (dimensionality, size and distribution). Various results were attained through testing and empirical studies. Obviously, some algorithms performed better than others.

The framework is further explored through a discussion of the subjective nature of clustering - every clustering algorithm uses some type of knowledge either implicitly or explicitly.

Large data set type problems are then addressed with the authors providing a description of the Incremental Clustering Algorithm as a suggested solution.

Four applications of data mining are then explored in great detail. This exploration process provides a vehicle for the discussion and critical analysis of a lot of the concepts introduced throughout the paper.

A summary ties together some of the loose ends of previous sections and reiterates some of the important aspects of clustering.

Monday, July 23, 2007

The Lingo Algorithm

The Lingo Algorithm was developed by Dawid Weiss and Stanislaw Osinski at Poznan University of Technology, Poland. Their project Carrot2 is an open source framework for building search clustering engines.

Carrot2 is the project pertaining to the development of an open-source search results clustering engine, developed by Weiss and Osinski. (Weiss and Osinski, 2007a) This engine employs the lingo algorithm for clustering in text mining, to produce understandable and diverse thematic clusters. (Ibid) The project has spawned a number of applications utilising this approach to automatic document classification. The release of the Carrot2 open-source standalone graphical user interface (“GUI”) clustering tool provides an API framework through which to replicate lingo algorithm experiments.

Carrot2 is a java-based application, consisting of a number of well-defined modules. These modules cover each of the text mining processes described in a previous posting. The Carrot2 API specification provides a structured presentation of the construct of these modules. It describes the Java interfaces, classes and methods of these modules. (Ibid) Moreover, the API makes available a number of well-explained demos and examples assisting in ease of use. This specification also houses a library of Carrot2 filters. This library is comprehensive and for each algorithm presents a hierarchy of its own composite interface and class summaries. Moreover, Carrot2 application source code provides a basis for the replication of the lingo-based clustering scheme as well as presenting a platform for the implementation of the framework necessary for my experiments.

A Concept-Driven Algorithm for Clustering Search Results

This paper is an interest-type piece from a 2005 IEEE publication. It is an interest-type piece in as much as it is not written in academic-prose but instead in a more reader-friendly and informal manner. This being said, it is highly informative and provides great insight into the lingo algorithm for clustering search results.

The authors (Osinki & Weiss, 2005) identify that in popular search engines search results match the searcher's question, rather than attempt to answer the search question themselves. This notion forms the premise for the rest of the paper, in which they describe the algorithm (the Lingo Algorithm) that facilitates the separation of search results into meaningful groups.

The Lingo Algorithm has a number of well-defined phases:

1. Preprocessing;

2. Phrase Extraction;

3. Cluster-Label Induction; and

4. Cluster-Content Allocation.

Finally, Osinki & Weiss evaluate the Lingo Algorithm through emperical and analytical evaluation. They explain experiments carried out and summarise the results of these experiments.

Project Plan

On Thursday 5 July, I met with my project supervisor to discuss the coming semester's project. We talked at length about the parameters of the project, as well as my goals and his experience with undergraduate thesis.

Jorge was able to point me toward a number of important publications in the area of clustering algorithms:

1. Jain, A., Murty, M., & Flynn, P. (1999). 'Data clustering: a review', ACM Computing Surveys, vol. 31, no. 3 (pp.264-324).

2. Landauer, T., & Dumais, S. (1997). 'A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge', Psychological Review, vol. 104, no. 2 (pp.211-240).

3. Landauer, T., Laham, D., Rehder, B., & Schreiner, M. (1997). 'How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans', M.G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society (pp. 412-417).

4. Osinski, S. 2006. 'Improving quality of search results clustering with approximate matrix factorisations', M. Lalmas et al. (Eds.): ECIR 2006, LNCS 3936 (pp.167-178).

5. Osinski, S., & Weiss, D. 2005. 'A concept-driven algorithm for clustering search results', IEEE Intelligent Systems, vol. 20, no. 3 (pp.48-54).

We briefly discussed each of these papers, and identified experiments which could be possibly replicated within my project.

At the end of this meeting, both Jorge and I were excited about the coming semester and the opportunity to work together on something that we are both interested in.

Sunday, May 27, 2007

Project Confirmation

The project was confirmed on 2 May 2007.

I am to be supervised by Prof Jorge Villalon and Dr Rafael Calvo of the Web Engineering Group of the University of Sydney, Australia ("WEG") .

On 25 May, Engineering students to be working on Thesis projects within WEG met for an introductory discussion of Semester 2's projects.

I am to now coordinate with Prof Villalon a time to meet at length to accetane the detail of this particular Thesis.

Clustering algorithms for web applications