Tuesday, July 31, 2007

Improving quality of search results clustering with approximate matrix factorisations

In popular Internet search engines, results to search queries are typically returned as linear lists of document summaries - this the query-list paradigm. A fundamental constraint of this type of scheme is that the results are not summarised by relevant topics, lending to sometimes inefficient Internet search.
Consider the following search "rugby". The query-list paradigm would return a linear list ordering documents by some criteria. This general query however could relate to rugby league, rugby union, the rugby world cup or any number of other topics.
An alternative approach to ordered lists is search result clustering. Search results clustering would see the results returned in groups of similar results - providing the searcher with well-structured search results.
Search results clustering employ a class of algorithms known as post-retrieval document clustering algorithms. These algorithms identify all of the major and outlier topics inherent with search results (generate clusters), label the clusters and automicallygroup the results.
The author then identifies a number of search result clustering algorithms, however observes that none of these algorithms explicitly address the problem of cluster description quality. He proposes that the Lingo algorithm could lead to better cluster description quality.
The Lingo algorithm is said to be a description-comes-first approach to clustering, as meaningful cluster labels are identified prior to document assignment. Fundamental to this process is this algorithm's use of the matrix factorisation technique of Singular-Value Decomposition (SVD). This technique however is not ideal in all cases.
The premise of this very interesting 2006 paper is to evaluate the performance of different matrix factorisation techniques as parts of a description-comes-first search results clustering algorithm.
Matric Factorisations
In this paper, the author analyses four matrix factorisation methods:
3. Local Non-negative Matrix Decomposition (LNMF); and
4. Concept Decomposition (CD).
Experiment Setup
The experiments carried out by Osinski were performed using data sourced from the Open Directory Project (ODP). The data comprised of 77 data sets, each containing a mixture of documents orginating from two to eight manually selected ODP categories. 63 of these data sets, used in the topic separation experiment, were represented by an equal number of documents whilst the remaining 14 data sets, used in the outlier detection experiment, contained documents from four closely related ODP categories plus one or two categories dealing with a totally different subject.
Experiment Method
The tests were separated into three parts, namely:
1. Topic separation experiment;
2. Outlier detection experiment; and
3. Subjective label quality judgements.
Each of the previously mentioned factorisation techniques were employed. Also, a number of alternative search result clustering algorithms were tested: Lingo, Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). The second and third algorithms do not follow the description-comes-first paradigm.
Experiment Result Discussion
Factorisation Method Comparison
The experiments found the NMF to significantly outperform the other factorisation methods, with respect to topic and snippet coverage while maintaining similar levels of cluster contamination (grouping purity). Osinski provides as rationale that NMF, in contrast to SVD, produces non-negative base vectors allow for better matching of frequent phrases found in input snippets.
Also, the experiments found LNMF to perform worse with respect to cluster contamination. Osinski notes that the cluster labels generated by this factorisation method to be shorter and more general than for the other NMF methods.
Noteworthy is that the k-means-based factorisation was unable to detect any of the outliers, labelling only the major subjects. This was reasonsed to be because this means usually locates its centroids in most dense areas of input snippet space - away from the outliers.
SVD was the only factorisation technique to discover one of the smallest outliers.
Algorithm Comparison
The experiments revealed that the Lingo algorithm and its desciption-comes-first approach to search results clustering outperformed both STC and TRC is topic separation and outlier detection tests.

No comments: