Tuesday, July 31, 2007

Clustering Algorithms General Background

Clustering is the process of abstracting patterns from data sets. Moreover, it is the labelling of a data point as belonging to a group of similarly identified data points.


Clustering is facilitated by algorithms, which are well-defined instructions for carrying out some task. Clustering algorithms encompass the preprocessing of the data, extracting clusters from the data, generating descriptions (labels) of the identified clusters and any post-processes like assessing cluster quality.


The use of clustering type schemes has become somewhat ubiquitous in many industries. As an example, clustering is used by retailers to abstract customer types and related behaviours. Consider Joe Bloggs Clothes Shop. The managers of this shop are able to employ a range of clustering algorithms to isolate specific types of shoppers - perhaps within some customer loyalty program. That is, a clustering algorithm may identify a group of female shoppers who made greater than three high valued purchases in the previous summer season range. Coming into the next summer, Joe Bloggs is able to send these customers specific catalogues and promotions that will encourage these shoppers to return to the store and repeat their previous behaviour.


An area in which clustering and clustering algorithms alike were arguably forged is in web applications. Web search is perhaps the most obvious of these applications. A lot of modern day web search engines return results to search queries as ordered lists of site links, and small descriptions of the related site. However this can be a somewhat clumsy and incumberscent approach to search, especially in the case of a general search entry (see previous "rugby" example). An alternative approach is search result clustering, through which results are clustered into groups of similar results.


Clustering and clustering algorithms emerged with the advent of the technologies that facilitate them, obviously computers. Thus as a field of science, it is in its relative infancy. See evolution progression in the following chart.

In recent years there has been significant developments in clustering algorithms for web applications. There has also been an observable emergence of new applications for these algorithms. Recent advances in clustering have been in part due to broad multi-disciplinary contributions in the wake of wide adoption.

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge

This subject of this paper published in 1997 is perhaps at first impressions not as directly relevant to my project as the other papers mentioned previously. It has an emphasis on learning and the acquisition of knowledge from the viewpoint of philosophical science; however it candidly employs Latent Semantic Analysis (LSA) for the experimental purposes. Thus, Landauer and Dumais provide a thorough review of LSA and its application in real experiments.

What is LSA? It is a "high-dimensional linear associative model that embodies no human knowledge beyond its general learning mechanism". (p.211) Alternatively, LSA can be described in its “bare mathematical formalism” in the “singular-value-decomposition matrix model”. (p.218)
The singular-value-decomposition (SVD) realization of LSA is a “general method for linear decomposition of a matrix into independent principle components”. (p.218) There are a number of methods of SVD application, however within the experiments conducted for this report ‘tf-idf’ was employed.
Term-frequency inverse-document-frequency (tf-idf) is a formula that provides an illustration of the relative importance of a term within a document and its relative importance of this occurrence within the corpus. It should be noted that terms that occurred in “at least two samples” are of interest. (p.218) Thus, for the purpose of SVD, the data size is compressed in size.

LSA is then employed within numerous experiments so as to provide a means for mimicking human learning.

Improving quality of search results clustering with approximate matrix factorisations

In popular Internet search engines, results to search queries are typically returned as linear lists of document summaries - this the query-list paradigm. A fundamental constraint of this type of scheme is that the results are not summarised by relevant topics, lending to sometimes inefficient Internet search.
Consider the following search "rugby". The query-list paradigm would return a linear list ordering documents by some criteria. This general query however could relate to rugby league, rugby union, the rugby world cup or any number of other topics.
An alternative approach to ordered lists is search result clustering. Search results clustering would see the results returned in groups of similar results - providing the searcher with well-structured search results.
Search results clustering employ a class of algorithms known as post-retrieval document clustering algorithms. These algorithms identify all of the major and outlier topics inherent with search results (generate clusters), label the clusters and automicallygroup the results.
The author then identifies a number of search result clustering algorithms, however observes that none of these algorithms explicitly address the problem of cluster description quality. He proposes that the Lingo algorithm could lead to better cluster description quality.
The Lingo algorithm is said to be a description-comes-first approach to clustering, as meaningful cluster labels are identified prior to document assignment. Fundamental to this process is this algorithm's use of the matrix factorisation technique of Singular-Value Decomposition (SVD). This technique however is not ideal in all cases.
The premise of this very interesting 2006 paper is to evaluate the performance of different matrix factorisation techniques as parts of a description-comes-first search results clustering algorithm.
Matric Factorisations
In this paper, the author analyses four matrix factorisation methods:
3. Local Non-negative Matrix Decomposition (LNMF); and
4. Concept Decomposition (CD).
Experiment Setup
The experiments carried out by Osinski were performed using data sourced from the Open Directory Project (ODP). The data comprised of 77 data sets, each containing a mixture of documents orginating from two to eight manually selected ODP categories. 63 of these data sets, used in the topic separation experiment, were represented by an equal number of documents whilst the remaining 14 data sets, used in the outlier detection experiment, contained documents from four closely related ODP categories plus one or two categories dealing with a totally different subject.
Experiment Method
The tests were separated into three parts, namely:
1. Topic separation experiment;
2. Outlier detection experiment; and
3. Subjective label quality judgements.
Each of the previously mentioned factorisation techniques were employed. Also, a number of alternative search result clustering algorithms were tested: Lingo, Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). The second and third algorithms do not follow the description-comes-first paradigm.
Experiment Result Discussion
Factorisation Method Comparison
The experiments found the NMF to significantly outperform the other factorisation methods, with respect to topic and snippet coverage while maintaining similar levels of cluster contamination (grouping purity). Osinski provides as rationale that NMF, in contrast to SVD, produces non-negative base vectors allow for better matching of frequent phrases found in input snippets.
Also, the experiments found LNMF to perform worse with respect to cluster contamination. Osinski notes that the cluster labels generated by this factorisation method to be shorter and more general than for the other NMF methods.
Noteworthy is that the k-means-based factorisation was unable to detect any of the outliers, labelling only the major subjects. This was reasonsed to be because this means usually locates its centroids in most dense areas of input snippet space - away from the outliers.
SVD was the only factorisation technique to discover one of the smallest outliers.
Algorithm Comparison
The experiments revealed that the Lingo algorithm and its desciption-comes-first approach to search results clustering outperformed both STC and TRC is topic separation and outlier detection tests.

Saturday, July 28, 2007

How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans

This 1997 paper uses a series of experiments to compare the judged perceived quality and quantity of knowledge conveyed within short student essays by both humans and a computer. The premise of these experiments is communicated in the paper title, a 'bag of words' used by a computer to assess knowledge.
The computer uses a corpus-based statistical model known as Latent Semantic Analysis (LSA) for "inducing and representing aspects of the meaning of words and passages reflected in their usage" (p. 412). LSA relies on singular value decomposition (SVD) to transform a corpus (collection of textual documents) into a matrix of document-term frequency terms.
Landauer, Laham, Rehder and Schreiner undertake two well-defined experiments in which LSA and human experts were put to work evaluating scientific-type short essays. The result of these experiments were that "LSA-based measures were closely related to human judgments as the latter were to each other and LSA measures predicted external measures of the same knowledge as well or better than did the human judgements" (p.418).

Data Clustering: A Review

Jain, Murty and Flynn present a taxonomy of clustering techniques in their 1999 paper Data Clustering: A Review. The intention of the paper is to "survey the core concepts and techniques in the large subset of cluster analysis with its roots in statistics and decision theory" (p. 266).
In the introduction of the paper, the authors importantly define clustering as a means for unsupervised learning and discern this sciences' difference to supervised learning. Clustering facilitates the grouping of a given collection of unlabelled patterns into meaningful clusters (similar data points). This is accomplished through measuring proximity (similarity) between data points - through well defined algorithms. On the other hand, in supervised learning schemes the data set is already classified into labelled patterns. The problem is instead is to label a newly encountered.
The paper then moves onto identify and describe the principle components of a clustering task. These tasks can be separated into three primary steps and two additional if-needed type steps:
1. pattern representation: this step is concerned with the preprocesses of feature extraction and feature selection where the forma is the process of abstracting the most effective subset of the original data set to use in clustering while the latter relates to the use of transformation(s) of the original data set to produce a revised data set;
2. definition of a pattern proximity measure appropriate to the data domain e.g. Euclidean distance;
3. clustering or grouping stage: this is the crux of unsupervised learning. Clusters can be either hard (definite partitions between groups) or alternatively fuzzy (where a pattern has varying degree of membership in each of the output clusters);
4. data abstraction: this step relates to the process of extracting simply and compact descriptions of clustered patterns;
5. assessment of output: this post-process aims to use specific measures of optimality to review the validity of clustering outcomes. Typically, statistical techniques and hypothesis type tests are employed to facilitate this process.
The authors then delve into the motivation for their paper, stating that previous critical analysis of clustering algorithms had failed to provide a framework for assessing tradeoffs of different techniques and for making a decision about which algorithm(s) to employ in a particular application. The authors make the observation that clustering algorithms contain implicit assumptions about the nature of the distribution of a data set. That is, the diversity of the clustering algorithms available to practitioners is a condition of the impossibly large number of possible unsupervised learning problems in existance. Hence with reference to their previously mentioned framework, the motivation for the paper.
Data Clustering: A Review is loosely organised by preprocesses, similarity measures, a taxonomy of clustering methods and techniques, and clustering applications.
Preprocesses: pattern representation, feature selection and extraction
This short section of the paper presents a number of observations about these preprocesses. With respect to pattern representation, it is noted that good representation can lead to simple and easily understood clustering i.e. polar coordinate versus Cartesian coordinates for a curvilinear cluster.
Similarity Measures
Similarity is the fundamental concept of unsupervised learning. In practice, it is dissimilarity between two patterns that is typically employed within learning schemes. Dissimilarity is measured by a distance measure defined on the feature space.
The most popular form of distance measurement for continuous features is the Euclidean distance (Minkowski metric). This technique's popularity centres about its intuitive appeal - it is used to calculate distance between objects in two and three-dimensional space.
This section of the paper reviews the Euclidean distance technique and introduces and similarly analyses a number of other similarity measures. These include distance based on counts for nominal attributes, syntatic approaches and mutual neighbour distance (MND).
A Taxonomy of Clustering Methods and Techniques
For the purpose of my project, this in-depth section of the paper is by far the most relevant.
The taxonomy describes the different approaches to data mining with a well-defined hierarchy. However, before traversing this hierarchy and each node's clustering algorithms the authors discuss five issues affecting different approaches regardless of their placement in the taxonomy:
1. Agglomerative ("bottom-up") versus divisive ("top-down"): the forma relates to algorithmic structure and operation that begins with each data point in a distinct cluster, and iteratively condenses these clusters until some stop condition is met. The latter on the other hand, begins with all data points in a single cluster and the iteration process stops splitting the clusters upon some stop-condition;
2. Monothetic versus polythetic: in the forma approach all features (axes) are included in calculating distances between patterns, whilst in the latter features are considered sequentially. Note that in the polythetic approac 2^d clusters are spawn;
3. Hard versus fuzzy: in a hard approach a data point is allocated a single cluster, whilst in a fuzzy approach a data point is assigned degrees of membership in several clusters;
4. Deterministic versus stochastic: the forma approach sees optimization being sought through traditional techniques (well-defined) whilst the latter performs a random search of the state space consisting of all possible labelings; and
5. Incremental versus non-incremetal: incremental approaches to clustering are not concerned with minimizing scans through the data set while non-incremental are (more efficient approaches have emerged since larger data sets have emerged).
Now, onto the taxonomy.
A. Hierarchical Clustering Algorithms
Hierarchical clustering algorithms produce classifications that can be considered from multiple levels of abstraction. A dendrogram is a representation of this nested grouping.
i. Single-link algorithm: the distance between two clusters is the minimum distances between all pairs of data points drawn from the two clusters
ii. Complete-link algorithm: the distance between to clusters is the maximum of all pairwise distances between patterns in the two clusters.
Algorithm (i) has a tendency to produce elongated clusters (the chaining effect), whilst algorithm (ii) produces tightly bound clusters. This aside, (i) is overall thought to be versatile than (ii).
Three hierarchical clustering algorithms are then explained:
1. Agglomerative Single-Link Clustering Algorithm;
2. Agglomerative Complete-Link Clustering Algorithm; and
3. Hierarchical Agglomerative Clustering Algorithm.
B. Partitional Algorithms
Partitional clustering algorithms obtain single partitions of data sets instead of a clustering-type structure. They are robust in the sense that they are suitable for analysing large data sets, however the choice of the number of clusters may seem arbitrary.
These algorithms generally discover clusters by optimizing some criterion function. The most intuitive and frequently used partitional algorithm is the squared error criterion. The k-means is the simplest and most commonly used algorithm employing this criterion. This works by starting with some random initial partition and then iteratively reassigns data points to clusters "based on the similarity between the data point and the cluster centres until a convergence criterion is met" (p.278). This convergence criterion may be no reassignment of datapoints between clusters or a stable error term. Note worthy are the variations of the k-means algorithm. In particular, a robust form of this algorithm will see the splitting of clusters whose distance between data points in above some threshold and merge clusters whose distance is below this threshold.
Another type of partitional algorithm is graph-theoretic clustering. The best-known graph-theoretic clustering is based on the construction of the minimal spanning tree (MST) of the data, and then deleting the longest edges so as to form clusters.
The taxonomy then broadly introduces nearest neighbour clustering in which an unlabeled data point is assigned the cluster of its nearest labeled neighbour (within some distance bounds).
The next part of paper goes into representation of clusters, that is how the outcome of classification is infact summarised and communicated. Three representations were suggested:
1. Represent a cluster of data points by their centroid (when clusters are compact or isotropic) or by distant points in the cluster (when clusters are elongated or non-isotopic);
2. Represent clusters using nodes in a classification tree; and
3. Represent clusters by using conjunctive logical expressions.
The taxonomy arrives next at Artificial Neural Networks (ANNs) which are said to be massively parallel distributed networks of simple processing units i.e. neurons. ANNs are characterised by pertaining to only numeric vectors and which learn adaptively through shifting interconnecting semantic weights. Examples of ANNs include Kohonen's learning vector quantization (LVQ) and self-organising map (SOM).
The next group of algorithms introduced are evolutionary approaches to clustering. These algorithms employ the evolutionary operators of selection, recombination, and mutation in potential solution transformations. The best-known evolutionary techniques used in clustering are genetic algorithms (GAs), evolution strategies (ESs), and evolutionary programming (EP).
Some other unsupervised learning techniques brushed over include fuzzy clustering algorithms, mixture-resolving and mode-seeking algorithms as well as search-based approaches (like the branch-and-bound technique and clustering based on simulated annealing).
The taxonomy then moves into critical evaluation, providing a summary of previous studies comparing clustering techniques and their respective findings. The dimension of these comparisons largely relate to effectivenes (how good the clustering outcomes were), efficiency (the speed of the classification) and other data set related concerns (dimensionality, size and distribution). Various results were attained through testing and empirical studies. Obviously, some algorithms performed better than others.
The framework is further explored through a discussion of the subjective nature of clustering - every clustering algorithm uses some type of knowledge either implicitly or explicitly.
Large data set type problems are then addressed with the authors providing a description of the Incremental Clustering Algorithm as a suggested solution.
Four applications of data mining are then explored in great detail. This exploration process provides a vehicle for the discussion and critical analysis of a lot of the concepts introduced throughout the paper.
A summary ties together some of the loose ends of previous sections and reiterates some of the important aspects of clustering.

Monday, July 23, 2007

The Lingo Algorithm

The Lingo Algorithm was developed by Dawid Weiss and Stanislaw Osinski at Poznan University of Technology, Poland. Their project Carrot2 is an open source framework for building search clustering engines.
Carrot2 is the project pertaining to the development of an open-source search results clustering engine, developed by Weiss and Osinski. (Weiss and Osinski, 2007a) This engine employs the lingo algorithm for clustering in text mining, to produce understandable and diverse thematic clusters. (Ibid) The project has spawned a number of applications utilising this approach to automatic document classification. The release of the Carrot2 open-source standalone graphical user interface (“GUI”) clustering tool provides an API framework through which to replicate lingo algorithm experiments.

Carrot2 is a java-based application, consisting of a number of well-defined modules. These modules cover each of the text mining processes described in a previous posting. The Carrot2 API specification provides a structured presentation of the construct of these modules. It describes the Java interfaces, classes and methods of these modules. (Ibid) Moreover, the API makes available a number of well-explained demos and examples assisting in ease of use. This specification also houses a library of Carrot2 filters. This library is comprehensive and for each algorithm presents a hierarchy of its own composite interface and class summaries. Moreover, Carrot2 application source code provides a basis for the replication of the lingo-based clustering scheme as well as presenting a platform for the implementation of the framework necessary for my experiments.

A Concept-Driven Algorithm for Clustering Search Results

This paper is an interest-type piece from a 2005 IEEE publication. It is an interest-type piece in as much as it is not written in academic-prose but instead in a more reader-friendly and informal manner. This being said, it is highly informative and provides great insight into the lingo algorithm for clustering search results.
The authors (Osinki & Weiss, 2005) identify that in popular search engines search results match the searcher's question, rather than attempt to answer the search question themselves. This notion forms the premise for the rest of the paper, in which they describe the algorithm (the Lingo Algorithm) that facilitates the separation of search results into meaningful groups.
The Lingo Algorithm has a number of well-defined phases:
1. Preprocessing;
2. Phrase Extraction;
3. Cluster-Label Induction; and
4. Cluster-Content Allocation.
Finally, Osinki & Weiss evaluate the Lingo Algorithm through emperical and analytical evaluation. They explain experiments carried out and summarise the results of these experiments.

Project Plan

On Thursday 5 July, I met with my project supervisor to discuss the coming semester's project. We talked at length about the parameters of the project, as well as my goals and his experience with undergraduate thesis.

Jorge was able to point me toward a number of important publications in the area of clustering algorithms:

1. Jain, A., Murty, M., & Flynn, P. (1999). 'Data clustering: a review', ACM Computing Surveys, vol. 31, no. 3 (pp.264-324).

2. Landauer, T., & Dumais, S. (1997). 'A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge', Psychological Review, vol. 104, no. 2 (pp.211-240).

3. Landauer, T., Laham, D., Rehder, B., & Schreiner, M. (1997). 'How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans', M.G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society (pp. 412-417).

4. Osinski, S. 2006. 'Improving quality of search results clustering with approximate matrix factorisations', M. Lalmas et al. (Eds.): ECIR 2006, LNCS 3936 (pp.167-178).

5. Osinski, S., & Weiss, D. 2005. 'A concept-driven algorithm for clustering search results', IEEE Intelligent Systems, vol. 20, no. 3 (pp.48-54).

We briefly discussed each of these papers, and identified experiments which could be possibly replicated within my project.

At the end of this meeting, both Jorge and I were excited about the coming semester and the opportunity to work together on something that we are both interested in.