The statistical approach of the "text mining" consists in to transform a collection of text documents in a matrix of numeric values on which we can apply machine learning algorithms.
The "unstructured document" designation is often used when one talks about text documents. This does not mean that he does not have a certain organization (titles, chapters, paragraphs, questions and answers, etc.). It shows first of all that we cannot express directly the collection in the form of a data table that is usually handled in data mining. To obtain this kind of data representation, a preprocessing phase is needed, then we extract relevant features to define the data table. These steps can influence heavily the relevance of the results.
In this tutorial, I take an exercise that I lead with my students for my text mining course at the University. We perform all the analysis under R with the dedicated packages for text mining such as “XML” or “tm”. The issue here is to perform exactly the study using other tools such as Knime 2.9.1 or RapidMiner 5.3 (Note: these are the versions available when I wrote the French version of this tutorial in April 2014). We will see that these tools provide specialized libraries which enable to perform efficiently a statistical text mining process.
Keywords: text mining, document classification, text categorization, decision tree, j48, lineat svm, reuters collection, XML format, stemming, stopwords, document-term matrix
Tutorial: en_Tanagra_Text_Mining.pdf
Dataset: text_mining_tutorial.zip
References :
Wikipedia, "Document classification".
S. Weiss, N. Indurkhya, T. Zhang, "Fundamentals of Predictive Text Mining", Springer, 2010.