Sunday, September 2, 2007

Clustering Experiments Dataset

Data acquisition in text mining experiments is concerned with the acquisition of information to be employed within the mining process itself. For the purposes of this paper, Reuters-21450 news corpus was the natural selection for undertaking data mining experiments. It is seen to be the benchmarking standard for automated text categorization, it is “estimated that this corpus has provided data for over 100 published research papers, particularly in the fields of information retrieval, natural language processing and machine learning.” (Rose et al., 2002, Weiss et al., 1999) The corpus relates to 10,788 news stories containing some 24,240 unique terms after stemming and the removal of stop words. In order to bench mark the outcomes of text mining algorithms, each story or document belongs to a category (on average 1.3 categories per document). There are 90 identified categories. The data is split into two sub-datasets according to ModApteSplit. (Lan et al., 2001) Thus, 75 per cent or 7769 documents of the corpus are taken to be the training set whilst the remaining 25 per cent or 3019 documents form the testing set.

Each document within the Reuters-21450 corpus has a title, a content section, a cluster assignment or a number of cluster assignments and an indicator of length. The corpus is partitioned into these documents and their respectable parts through specific text-based tags.

No comments: