Monday, September 3, 2007

Preprocessing Reuters-21450 Dataset

The Reuters-21450 dataset is a text file with a number of tags identifying the individual parts of the each article. However, this quasi-structure is not separated by fixed part lengths - thus the need to apply logic to separate each article into individual text documents.
I was able to facilitate this process through Excel and VBA. I copied the Reuters document into Excel, and was able to process its structure via its unique tags. This allowed me to build a table with each documents unique parts across the rows of the table. I was then able to write the bodies of each off these documents to individual text documents, whose filename is its article ID number. This process is illustrated in the following picture.



The end result of the application of this process is a repository of 10,621 text files - each with its corresponding article ID as its filename.
This provides the basic data for the next part of preprocessing which includes parsing, the removal of stop words and the stemming of terms. Terms in my approach to text mining include both keywords and key phrases.
It is worth noting that the table created in this initial preprocessing phase will form the fundamental cornerstone for the clustering process - that is, it relates each document to its concept(s). It is used for the purpose of comparing clustering outcomes with human-defined reality.

No comments: