Clustering algorithms for web applications: August 2007

Saturday, August 25, 2007

Paper References

I have received a number enquiries about the research I have undertaken thus far in my project. In response, I have decided to post the resources I have included to date:

1. AGHAGOLZADEH, M., SOLTANIAN-ZADEH, H. & ARAABI, B. N. (2006) Finding the Number of Clusters in a Dataset Using an Information Theoretic Hierarchial Algorithm. IEEE, 1336-1339.
2. AHONEN, H., HEINONEN, O., KLEMETTINEN, M. & VERKAMO, A. I. (1997) Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collectors. Department of Computer Science. Helsinki, University of Helsinki.
3. AMARASIRI, R., CEDDIA, J. & ALAHAKOON, D. (2005) Exploratory Data Mining Lead by Text Mining Using a Novel High Dimensional Clustering Algorithm. Fourth International Conference on Machine Learning and Applications (ICMLA'05). IEEE Computer Society.
4. ATKINSON-ABUTRIDY, J., MELLISH, C. & AITKEN, S. (2004) Combining Information Extraction with Genetic Algorithms for Text Mining. IEEE Intelligent Systems, 22-30.
5. AUMANN, Y., FELDMAN, R., YEHUDA, Y. B., LANDAU, D., LIPHSTAT, O. & SCHLER, Y. (1999) Circle Graphs: New Visualization Tools for Text-Mining. J.M. Zytkow and J. Rauch (Eds): PKDD'99, 277-282.
6. CALVO, R., JOSE, J. & ADEVA, G. (2006) Mining Text with Pimiento. IEEE Internet Computing, 27- 35.
7. CHANG, H., HSU, C. & DENG, Y. (2004) Unsupervised Document Clustering Based on Keyword Clusters. International Symposium on Communications and Information Technologies 2004 (ISCIT 2004). Sapporo, Japan.
8. CHEN, J., YAN, J., ZHANG, B., YANG, Q. & CHEN, Z. (2006) Diverse Topic Phrase Extraction through Latent Semantic Analysis. Sixth International Conference on Data Mining (ICDM'06). IEEE Computer Society.
9. CODY, W., KREULEN, J., KRISHNA, V. & SPANGLER, W. (2002) The integration of business intelligence and knowledge management. IBM Systems Journal, 41, 697-713.
11. EL-BELTAGY, S. R. (2006) KP-Miner: A Simple System for Effective Keyphrase Extraction. IEEE, 1-5.
12. FAN, W., WALLACE, L. & RICH, S. (2006) Tapping the Power of Text Mining. Communications of the ACM, 49, 77-82.
13. GLECH, D. & ZHUKOV, L. (2003) SVD Subspace Projections for Term Suggestion Ranking and Clustering. Claremont, California, Harvey Mudd College, Yahoo! Research Labs.
14. GRIMES, S. (2003) Decision Support: The Word on Text Mining. Intelligent Entreprise, 6, 12-13.
15. HOFMANN, D. G. T. (2006) Non-redundant data clustering. Knowledge and Information Systems, 1-24.
16. HSU, H. C. C. (2005) Using Topic Keyword Clusters for Automatic Document Clustering. Third International Conference on Information Technology and Applications (ICITA'05). IEEE.
17. IIRITANO, S. & RUFFOLO, S. (2001) Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining. IEEE, 454-458.
18. JAIN, A. K., MURTY, M. N. & FLYNN, P. J. (1999) Data Clustering: A Review. ACM Computing Serveys, 31, 264-323.
19. JENSEN, R., II, K. E. H., ERDOGMUS, D., PRINCIPE, J. C. & ELTOFT, T. (2003) Clustering using Renyi's Entropy. IEEE, 523-528.
20. LAN, M., SUNG, S., LOW, H. & TAN, C. (2001) A Comparative Study on Term Weighting Schemes for Text Categorization. Department of Computer Science. Singapore, National University of Singapore.
21. LANDAUER, T. K. & DUMAIS, S. T. (1997) A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Pyschological Review, 104, 211-240.
22. LANDAUER, T. K., LAHAM, D., REHDER, B. & SCHREINER, M. E. (1997) How Well can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans.
23. MOTES-Y-GOMEZ, M., A.GELBUKH, LOPEZ-LOPEZ, A. & BAEZA-YATES, R. (2001) Text Mining with Conceptual Graphs. IEEE, 898-903.
24. ONG, S. A. J. (2000) A Data Mining Strategy for Inductive Data Clustering: A Synergy Between Self Organising Neural Networks and K-Means Clustering Techniques. IEEE.
25. OSINSKI, S. (2006) Improving Quality of Search Results Clustering with Approximate Matrix Factorisations. M. Lalmas et al. (Eds.): ECIR 2006, 167-178.
26. OSINSKI, S. & WEISS, D. (2005) A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems, 48-54.
27. QIAN, Y. & SUEN, C. Y. (2000) Clustering Combination Method. IEEE, 732-735.
28. SHARMA, R. & RAMAN, S. (2003) Phrase-based Text Representation for Managing Web Documents. International Conference on Information Technology: Computers and Communication (ITCC'03). IEEE Computer Society.
29. SHEHATA, S., KARRAY, F. & KAMEL, M. (2006) Enhancing Text Clustering using Concept-based Mining Model. Sixth International Conference on Data Mining. IEEE Computer Society.
30. STEINBACH, M., KARYPIS, G. & KUMAR, V. (2006) A Comparison of Document Clustering Techniques. IEEE, 1-2.
31. TJHI, W. & CHEN, L. (2006) Flexible Fuzzy Co-Clustering with Feature-cluster Weighting. IEEE.
32. TSUJII, J. & ANANIADOU, S. (2005) Thesaurus or Logical Ontology, Which One Do We Need for Text Mining. Language Resources and Evaluation, 39, 77-90.
34. WEISS, S. M., APTE, C., DAMERAU, F. J., JOHNSON, D. E., J.OLES, F., GOETZ, T. & HAMPP, T. (1999) Maximizing Text-Mining Performance. IEEE, 63-69.
35. WIEMER-HASTINGS, P. & ZIPITRIA, I. (2000) Rules for Syntax, Vectors for Semantics. Edinburgh, University of Edinburgh.
36. WITTEN, I. H. & FRANK, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques, San Francisco, Morgan Kaufmann Publishers.
37. WONG, P., COWLEY, W., FOOTE, H., JURRUS, E. & THOMAS, J. (2000) Visualizing Sequential Patterns for Text Mining. IEEE Synposium on Information Visualisation 2000 (InfoVis'00).
38. WU, H. & GUNOPLOUS, D. (2002) Evaluating the Utiliy of Statistical Phrases and Latent Semantic Indexing for Text Classification. IEEE, 713-716.
39. YANG, H. & LEE, C. (2003) A Text Mining Approach on Automatic Generation of Web Directories and Hierarchies. IEEE/WIC International Conference on Web Intelligence (WI'03). IEEE.
40. YU, J. (2005) General C-Means Clustering Model. IEEE Computer Society, 1197-1211.
ZELIKOVITZ, S. & HIRSH, H. (2000) Using LSI for Text Classification in the Presence of Background Text. Piscataway, New Jersey, Rutgers University.
41. ZHONG, M., CHEN, Z. & LIN, Y. (2004) Using Classification and key phrase Extraction for information retrieval. IEEE, 3037-3041.

Paper Introduction

The ubiquitous adoption of information systems has resulted in an explosion of data. This phenomenon has been largely supported by a continually evolving technological landscape, fundamentally the Internet. The need for businesses and organisations to understand this data, in hand with the enabling progress in computer processing, has led to the birth of data mining.

Data mining is the process of abstracting patterns from data so as to learn something interesting or to make nontrivial predictions on unseen data (Frank, 2005). There are two learning paradigms employed in data mining systems, namely supervised and unsupervised learning. The forma relates to the use of data containing some metadata while the latter is concerned with unseen data within data mining applications.

Unsupervised learning employs clustering for the purposes of abstracting patterns from data. Clusters are groups of similar data points. Clustering algorithms are well-defined processes that facilitate learning. They generally identify similar data points, label points and generate cluster labels.

There is a dichotomy in data mining problems. More traditionally, data of a numerical or structured nature was employed within learning schemes. However in the wake of the Internet there has been a reversion toward unstructured or textual-type data problems, which has in turn led to the surfacing of text mining as a sub-type of data mining. Mining text-based datasets requires additional consideration due to their apparent randomness and embedded semantic meanings.

Text mining requires that the acquired unstructured data undertake pre-processing so as to produce “some kind of semi-structured representation” for use in the discovery phase. (Baeza-Yates, 2001) Generally, pre-processing involves transforming text-based datasets (corpuses) into matrices pertaining to documents, terms and document-term weightings. (Weiss, 2005) In recent years, there has been a lot of research centring the optimisation of these operations. A popular approach to text mining that employs this type of scheme is latent semantic analysis (“LSA”).

LSA is a text mining method for “inducing and representing aspects of the meaning of words and passages reflected in their usage”. (Schreiner, 1997) LSA converts a corpus into a term-document matrix where matrix cells are the frequencies of the term in a given document. Central to the formation of this matrix is the need to first identify a dictionary of relevant terms from the collection of documents. (Hampp, 1999) This approach is the “bag of words method” which does not take into account the word order nor word context within documents. (Schreiner, 1997) Landauer et al found LSA to perform just as well as human judges in classification. (Ibid) Recent work in text mining has moved toward developing schemes that tend to use “more complete representations than just key words ... [as there is a belief] that these new representations will expand the kinds of discovered knowledge”. (Baeza-Yates, 2001) Karray et al proposed a concept-based mining model that captures semantics of text through analysis of both the sentence and document. (Kamel, 2006) They demonstrated that their method enhanced the automatic clustering of documents. (Ibid) Osinski and Weiss developed the lingo algorithm in order to improve the quality of cluster labels. They employed a description-comes-first approach to clustering in which “good, conceptually varied cluster labels” are found prior to assigning “documents to the labels to form groups”. (Weiss, 2005) The labels are deducted through key phrase extraction. They found that key phrases to be frequent phrases and performed extraction through a version of SHOC’s phrase extraction algorithm. (Ibid) In this paper, I intend to improve the automated extraction of key phrases from corpora using identified semantically important words for the purposes of forming better semi-structured data for clustering.

Chapter two of this paper provides a background of related research, setting context and providing a platform for the remaining chapters. It is organised into three primary and interrelated sections. Chapter three outlines the design of the clustering algorithms for web applications experiments. Chapter four is the analysis and discussion of experiment results. Chapter five is the conclusions drawn from previous sections.

Monday, August 20, 2007

Background Structure

As alluded to in a previous post, the paper topic of "Clustering Algorithms for Web Applications" warrants the specific consideration of a number of interelated areas of research. Structuring the background section of the paper which has been structured into three primary sections:

1. Text Mining
1.1 Text Mining Introduction
1.2 Text Mining Process
1.2.1 Data Pre-Processing
1.2.2 Text Mining
1.2.3 Result Interpretation and Refinement
1.3 Text Mining Algorithms, Evolution and Evaluation
1.4 Text Mining Software Design

2. Clustering
2.1 Clustering Process
2.2 Clustering Theoretical Framework
2.3 Clustering Algorithms
2.4 Cluster Representation

2.5 Result Interpretation and Refinement

2.6 Clustering Development and Evaluation

3. Clustering in Text Mining
3.1 Clustering Algorithms in Text Mining
3.1.1 Clustering Algorithms in Text Mining Introduction
3.1.2 Clustering Algorithms in Text Mining Evolution and Evaluation
3.2 Clustering Algorithms for Web Applications

This section of the thesis is intended to set the context and provide a platform for the remaining sections. Considering the above structure provides a snapshot of where this paper sits in terms of previous contributions and illustrates my approach to the topic.

Saturday, August 18, 2007

Paper Abstract - as at 19 August 2007

The entrenchment of the Internet into modern society has led to a proliferation of unstructured information. The amount of unstructured information has also been compounded by the broad ubiquitous adoption of information systems. Hence there is an endemic and growing need to extract the knowledge hidden within collections of documents, thus the emergence of text mining. Unsupervised learning schemes allow for the abstraction of patterns in data, and are typically facilitated by clustering algorithms. Recent progress in computational processing has provided greater opportunity to develop and employ clustering algorithms in text mining, leading to a myriad of contributions. Modern search engines continue to move towards automated web page retrieval that is more efficient, and provides the searcher with understandable and relevant search results. This paper focuses on clustering algorithms for web applications. A critical review of current algorithms, their evolution and an evaluation is presented. This analysis forms a framework through which to consider the nature of this progress and some of the limitations of the current clustering technologies. A number of text mining experiments are replicated and extended, with a comprehensive discussion to further illustrate these issues.

Clustering algorithms for web applications