1. What I did?
Monday, October 29, 2007
Treatise Summary
1. What I did?
Sunday, October 14, 2007
Clustering Experiments
Sunday, October 7, 2007
Custom Lingo Document Clustering Application
Saturday, September 29, 2007
Removal of Fuzzy Documents
What am I left with?
How are these clusters distributed?
In the above chart, only clusters with greater than 50 documents are included. That is, clusters whose documents represent greater than 0.6% of the corpus. These are the 'top 15' documents, and in total they account for 91.4% of all documents. The rest are allocated to other: 774 documents. The 'top two' clusters represent 68% of all documents, and thus this dataset is not normally distributed.
Tuesday, September 18, 2007
Text Mining Application
The above diagram illustrates a summary of the information retrieval and document preprocessing implementation of text mining employed by both KEA and Carrot2 (however, with Carrot2 specifically geared for web search results). As eluded to on the post on September 2, Carrot2 explicitly uses Apache Lucene 2.2 (adjusted for web search resutls) to facilitate the above processes. Each broadly undertakes each of -> points depicted in the central box, except for the marked [] which relates to the KEA controlled indexing approach to text mining. The other marking within this box * ear-marks variation between the two text mining schemes and well as signing options within the individual schemes. Each is now considered:
1. Stemming: there are a number of implicit options available to both schemes with regards to stemming-
a. Lovins Stemmer: the most aggressive stemmer, with some 294 nested rules;
b. Porter Stemmer: this more subdued approach has some 37 rules;
c. Partial Porter Stemmer: the Porter Stemmer has 5 multi-part stages which allow the "miner" to be more or be less conservative in their stemming process. The first stage Porter stemmer is a popular methodology, handling basic plurals e.g. horses becomes horse, processes becomes process, men does not become man; and
d. No Stemmer.
2. Stop words: KEA defines some 499 stop-words, Carrot2 some 324 stop-words (adjusted lucene), lucene only some 33 and the Brown Corpus some 425 (Sharma et Raman, 2003).
3. Vocabularies: This is only relevant to KEA. The "miner" is able to dicate a dictionary, thesaurus or list of terms when undertaking controlled indexing. Also, in terms of implementation, these can be in either text form or resourse description format ("rdf"). A number of popular vocabularies are in circulation, including the integrated public servicevocabulary ("ipsv") and the agrovoc vocoabulary.
How do these options affect clustering?
There are essentially horses for courses. That is, a particular data set or text mining experiment may require more aggressive or more conservative preprocessing.
How do these systems relate to my thesis?
I am now working on the specific implementation of clustering: LSA/SVD & description comes first - replacing KEA's NaiveBayes classifier with Carrot2's lingo algorithm. These frameworks are both java based and open-source.
KEA provides a very suitable fit with my experiments, as it is not based on an index of Internet search results but instead a directory of text files. Carrot2's lingo is an unsupervised approach to learning and hence allows for clustering. A potential shortfall of this approach is that lingo uses LSA/SVD and hence forms a term-document matrix- thus unsuitable for scalability considering the size of my proposed Reuters-21450 dataset (see also post relating to pre-preprocssing of this dataset). I have a number of papers which focus on such constraints (i.e. using smaller term dictionaries, more aggressive stemming [as eluded to above], retricting the number of identified phrases - either by increasing minimum phrase length or increasing maximum phrase length, etc). This leads to, adapting Carrot2 for the purposes of larger datasets.
What are the limitations of such an approach?
Initial complexities associated with such an approach:
1. The purpose of Lingo2 is to create more understandable cluster labels, thus a tension exists between label quality (as well as associated cluster quality in description comes first) and scalability;
2. Even with very aggressive stemming, scalability may still be an issue -that is, the size of the dataset too large for such experiments. I could therefore reduce the number of concepts and associated documents.
Alternatively, I could adopt a different dataset. In so much I could undertake similar experiments to those reported on the KEA website. In these experiments, the group use the controlled agrovac vocabulary to extract key phrases from agricultural documents. I could extract phrases using the unsupervised KEA algorithm and compare the keyphrases as extracted using the adapted carrot2 scheme. That is, using the same extraction and preprocessing techniques and allowing for different clustering and resultant keyphrase extraction outcomes; and
3. Measuring cluster label quality and associated cluster quality - I have undertaking further research into this issue beyond my previous research.
Monday, September 3, 2007
Preprocessing Reuters-21450 Dataset
Sunday, September 2, 2007
Proposed Carrot2 Framework
Using Carrot2 API and source code to resource a lot of the peripheral text mining services allows for a stable environment through which to test the effect of changes to the assumed key phrase on the outcomes of clustering.
Clustering Experiments Dataset
Each document within the Reuters-21450 corpus has a title, a content section, a cluster assignment or a number of cluster assignments and an indicator of length. The corpus is partitioned into these documents and their respectable parts through specific text-based tags.
Saturday, August 25, 2007
Paper References
2. AHONEN, H., HEINONEN, O., KLEMETTINEN, M. & VERKAMO, A. I. (1997) Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collectors. Department of Computer Science. Helsinki, University of Helsinki.
3. AMARASIRI, R., CEDDIA, J. & ALAHAKOON, D. (2005) Exploratory Data Mining Lead by Text Mining Using a Novel High Dimensional Clustering Algorithm. Fourth International Conference on Machine Learning and Applications (ICMLA'05). IEEE Computer Society.
4. ATKINSON-ABUTRIDY, J., MELLISH, C. & AITKEN, S. (2004) Combining Information Extraction with Genetic Algorithms for Text Mining. IEEE Intelligent Systems, 22-30.
5. AUMANN, Y., FELDMAN, R., YEHUDA, Y. B., LANDAU, D., LIPHSTAT, O. & SCHLER, Y. (1999) Circle Graphs: New Visualization Tools for Text-Mining. J.M. Zytkow and J. Rauch (Eds): PKDD'99, 277-282.
6. CALVO, R., JOSE, J. & ADEVA, G. (2006) Mining Text with Pimiento. IEEE Internet Computing, 27- 35.
7. CHANG, H., HSU, C. & DENG, Y. (2004) Unsupervised Document Clustering Based on Keyword Clusters. International Symposium on Communications and Information Technologies 2004 (ISCIT 2004). Sapporo, Japan.
8. CHEN, J., YAN, J., ZHANG, B., YANG, Q. & CHEN, Z. (2006) Diverse Topic Phrase Extraction through Latent Semantic Analysis. Sixth International Conference on Data Mining (ICDM'06). IEEE Computer Society.
9. CODY, W., KREULEN, J., KRISHNA, V. & SPANGLER, W. (2002) The integration of business intelligence and knowledge management. IBM Systems Journal, 41, 697-713.
11. EL-BELTAGY, S. R. (2006) KP-Miner: A Simple System for Effective Keyphrase Extraction. IEEE, 1-5.
12. FAN, W., WALLACE, L. & RICH, S. (2006) Tapping the Power of Text Mining. Communications of the ACM, 49, 77-82.
13. GLECH, D. & ZHUKOV, L. (2003) SVD Subspace Projections for Term Suggestion Ranking and Clustering. Claremont, California, Harvey Mudd College, Yahoo! Research Labs.
14. GRIMES, S. (2003) Decision Support: The Word on Text Mining. Intelligent Entreprise, 6, 12-13.
15. HOFMANN, D. G. T. (2006) Non-redundant data clustering. Knowledge and Information Systems, 1-24.
16. HSU, H. C. C. (2005) Using Topic Keyword Clusters for Automatic Document Clustering. Third International Conference on Information Technology and Applications (ICITA'05). IEEE.
17. IIRITANO, S. & RUFFOLO, S. (2001) Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining. IEEE, 454-458.
18. JAIN, A. K., MURTY, M. N. & FLYNN, P. J. (1999) Data Clustering: A Review. ACM Computing Serveys, 31, 264-323.
19. JENSEN, R., II, K. E. H., ERDOGMUS, D., PRINCIPE, J. C. & ELTOFT, T. (2003) Clustering using Renyi's Entropy. IEEE, 523-528.
20. LAN, M., SUNG, S., LOW, H. & TAN, C. (2001) A Comparative Study on Term Weighting Schemes for Text Categorization. Department of Computer Science. Singapore, National University of Singapore.
21. LANDAUER, T. K. & DUMAIS, S. T. (1997) A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Pyschological Review, 104, 211-240.
22. LANDAUER, T. K., LAHAM, D., REHDER, B. & SCHREINER, M. E. (1997) How Well can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans.
23. MOTES-Y-GOMEZ, M., A.GELBUKH, LOPEZ-LOPEZ, A. & BAEZA-YATES, R. (2001) Text Mining with Conceptual Graphs. IEEE, 898-903.
24. ONG, S. A. J. (2000) A Data Mining Strategy for Inductive Data Clustering: A Synergy Between Self Organising Neural Networks and K-Means Clustering Techniques. IEEE.
25. OSINSKI, S. (2006) Improving Quality of Search Results Clustering with Approximate Matrix Factorisations. M. Lalmas et al. (Eds.): ECIR 2006, 167-178.
26. OSINSKI, S. & WEISS, D. (2005) A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems, 48-54.
27. QIAN, Y. & SUEN, C. Y. (2000) Clustering Combination Method. IEEE, 732-735.
28. SHARMA, R. & RAMAN, S. (2003) Phrase-based Text Representation for Managing Web Documents. International Conference on Information Technology: Computers and Communication (ITCC'03). IEEE Computer Society.
29. SHEHATA, S., KARRAY, F. & KAMEL, M. (2006) Enhancing Text Clustering using Concept-based Mining Model. Sixth International Conference on Data Mining. IEEE Computer Society.
30. STEINBACH, M., KARYPIS, G. & KUMAR, V. (2006) A Comparison of Document Clustering Techniques. IEEE, 1-2.
31. TJHI, W. & CHEN, L. (2006) Flexible Fuzzy Co-Clustering with Feature-cluster Weighting. IEEE.
32. TSUJII, J. & ANANIADOU, S. (2005) Thesaurus or Logical Ontology, Which One Do We Need for Text Mining. Language Resources and Evaluation, 39, 77-90.
34. WEISS, S. M., APTE, C., DAMERAU, F. J., JOHNSON, D. E., J.OLES, F., GOETZ, T. & HAMPP, T. (1999) Maximizing Text-Mining Performance. IEEE, 63-69.
35. WIEMER-HASTINGS, P. & ZIPITRIA, I. (2000) Rules for Syntax, Vectors for Semantics. Edinburgh, University of Edinburgh.
36. WITTEN, I. H. & FRANK, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques, San Francisco, Morgan Kaufmann Publishers.
37. WONG, P., COWLEY, W., FOOTE, H., JURRUS, E. & THOMAS, J. (2000) Visualizing Sequential Patterns for Text Mining. IEEE Synposium on Information Visualisation 2000 (InfoVis'00).
38. WU, H. & GUNOPLOUS, D. (2002) Evaluating the Utiliy of Statistical Phrases and Latent Semantic Indexing for Text Classification. IEEE, 713-716.
39. YANG, H. & LEE, C. (2003) A Text Mining Approach on Automatic Generation of Web Directories and Hierarchies. IEEE/WIC International Conference on Web Intelligence (WI'03). IEEE.
40. YU, J. (2005) General C-Means Clustering Model. IEEE Computer Society, 1197-1211.
ZELIKOVITZ, S. & HIRSH, H. (2000) Using LSI for Text Classification in the Presence of Background Text. Piscataway, New Jersey, Rutgers University.
41. ZHONG, M., CHEN, Z. & LIN, Y. (2004) Using Classification and key phrase Extraction for information retrieval. IEEE, 3037-3041.
Paper Introduction
Data mining is the process of abstracting patterns from data so as to learn something interesting or to make nontrivial predictions on unseen data (Frank, 2005). There are two learning paradigms employed in data mining systems, namely supervised and unsupervised learning. The forma relates to the use of data containing some metadata while the latter is concerned with unseen data within data mining applications.
Unsupervised learning employs clustering for the purposes of abstracting patterns from data. Clusters are groups of similar data points. Clustering algorithms are well-defined processes that facilitate learning. They generally identify similar data points, label points and generate cluster labels.
There is a dichotomy in data mining problems. More traditionally, data of a numerical or structured nature was employed within learning schemes. However in the wake of the Internet there has been a reversion toward unstructured or textual-type data problems, which has in turn led to the surfacing of text mining as a sub-type of data mining. Mining text-based datasets requires additional consideration due to their apparent randomness and embedded semantic meanings.
Text mining requires that the acquired unstructured data undertake pre-processing so as to produce “some kind of semi-structured representation” for use in the discovery phase. (Baeza-Yates, 2001) Generally, pre-processing involves transforming text-based datasets (corpuses) into matrices pertaining to documents, terms and document-term weightings. (Weiss, 2005) In recent years, there has been a lot of research centring the optimisation of these operations. A popular approach to text mining that employs this type of scheme is latent semantic analysis (“LSA”).
LSA is a text mining method for “inducing and representing aspects of the meaning of words and passages reflected in their usage”. (Schreiner, 1997) LSA converts a corpus into a term-document matrix where matrix cells are the frequencies of the term in a given document. Central to the formation of this matrix is the need to first identify a dictionary of relevant terms from the collection of documents. (Hampp, 1999) This approach is the “bag of words method” which does not take into account the word order nor word context within documents. (Schreiner, 1997) Landauer et al found LSA to perform just as well as human judges in classification. (Ibid) Recent work in text mining has moved toward developing schemes that tend to use “more complete representations than just key words ... [as there is a belief] that these new representations will expand the kinds of discovered knowledge”. (Baeza-Yates, 2001) Karray et al proposed a concept-based mining model that captures semantics of text through analysis of both the sentence and document. (Kamel, 2006) They demonstrated that their method enhanced the automatic clustering of documents. (Ibid) Osinski and Weiss developed the lingo algorithm in order to improve the quality of cluster labels. They employed a description-comes-first approach to clustering in which “good, conceptually varied cluster labels” are found prior to assigning “documents to the labels to form groups”. (Weiss, 2005) The labels are deducted through key phrase extraction. They found that key phrases to be frequent phrases and performed extraction through a version of SHOC’s phrase extraction algorithm. (Ibid) In this paper, I intend to improve the automated extraction of key phrases from corpora using identified semantically important words for the purposes of forming better semi-structured data for clustering.
Monday, August 20, 2007
Background Structure
1. Text Mining
1.1 Text Mining Introduction
1.2 Text Mining Process
1.2.1 Data Pre-Processing
1.2.2 Text Mining
1.2.3 Result Interpretation and Refinement
1.3 Text Mining Algorithms, Evolution and Evaluation
1.4 Text Mining Software Design
2. Clustering
2.1 Clustering Process
2.2 Clustering Theoretical Framework
2.3 Clustering Algorithms
2.4 Cluster Representation
3. Clustering in Text Mining
3.1 Clustering Algorithms in Text Mining
3.1.1 Clustering Algorithms in Text Mining Introduction
3.1.2 Clustering Algorithms in Text Mining Evolution and Evaluation
3.2 Clustering Algorithms for Web Applications
This section of the thesis is intended to set the context and provide a platform for the remaining sections. Considering the above structure provides a snapshot of where this paper sits in terms of previous contributions and illustrates my approach to the topic.
Saturday, August 18, 2007
Paper Abstract - as at 19 August 2007
Tuesday, July 31, 2007
Clustering Algorithms General Background
A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge
What is LSA? It is a "high-dimensional linear associative model that embodies no human knowledge beyond its general learning mechanism". (p.211) Alternatively, LSA can be described in its “bare mathematical formalism” in the “singular-value-decomposition matrix model”. (p.218)
LSA is then employed within numerous experiments so as to provide a means for mimicking human learning.
Improving quality of search results clustering with approximate matrix factorisations
Saturday, July 28, 2007
How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans
Data Clustering: A Review
Monday, July 23, 2007
The Lingo Algorithm
Carrot2 is a java-based application, consisting of a number of well-defined modules. These modules cover each of the text mining processes described in a previous posting. The Carrot2 API specification provides a structured presentation of the construct of these modules. It describes the Java interfaces, classes and methods of these modules. (Ibid) Moreover, the API makes available a number of well-explained demos and examples assisting in ease of use. This specification also houses a library of Carrot2 filters. This library is comprehensive and for each algorithm presents a hierarchy of its own composite interface and class summaries. Moreover, Carrot2 application source code provides a basis for the replication of the lingo-based clustering scheme as well as presenting a platform for the implementation of the framework necessary for my experiments.
A Concept-Driven Algorithm for Clustering Search Results
Project Plan
Jorge was able to point me toward a number of important publications in the area of clustering algorithms:
1. Jain, A., Murty, M., & Flynn, P. (1999). 'Data clustering: a review', ACM Computing Surveys, vol. 31, no. 3 (pp.264-324).
2. Landauer, T., & Dumais, S. (1997). 'A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge', Psychological Review, vol. 104, no. 2 (pp.211-240).
3. Landauer, T., Laham, D., Rehder, B., & Schreiner, M. (1997). 'How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans', M.G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society (pp. 412-417).
4. Osinski, S. 2006. 'Improving quality of search results clustering with approximate matrix factorisations', M. Lalmas et al. (Eds.): ECIR 2006, LNCS 3936 (pp.167-178).
5. Osinski, S., & Weiss, D. 2005. 'A concept-driven algorithm for clustering search results', IEEE Intelligent Systems, vol. 20, no. 3 (pp.48-54).
We briefly discussed each of these papers, and identified experiments which could be possibly replicated within my project.
At the end of this meeting, both Jorge and I were excited about the coming semester and the opportunity to work together on something that we are both interested in.
Sunday, May 27, 2007
Project Confirmation
I am to be supervised by Prof Jorge Villalon and Dr Rafael Calvo of the Web Engineering Group of the University of Sydney, Australia ("WEG") .
On 25 May, Engineering students to be working on Thesis projects within WEG met for an introductory discussion of Semester 2's projects.
I am to now coordinate with Prof Villalon a time to meet at length to accetane the detail of this particular Thesis.