Saturday, September 17, 2016

Text mining - Document classification

The statistical approach of the "text mining" consists in to transform a collection of text documents in a matrix of numeric values on which we can apply machine learning algorithms.

The "unstructured document" designation is often used when one talks about text documents. This does not mean that he does not have a certain organization (titles, chapters, paragraphs, questions and answers, etc.). It shows first of all that we cannot express directly the collection in the form of a data table that is usually handled in data mining. To obtain this kind of data representation, a preprocessing phase is needed, then we extract relevant features to define the data table. These steps can influence heavily the relevance of the results.

In this tutorial, I take an exercise that I lead with my students for my text mining course at the University. We perform all the analysis under R with the dedicated packages for text mining such as “XML” or “tm”. The issue here is to perform exactly the study using other tools such as Knime 2.9.1 or RapidMiner 5.3 (Note: these are the versions available when I wrote the French version of this tutorial in April 2014). We will see that these tools provide specialized libraries which enable to perform efficiently a statistical text mining process.

Keywords: text mining, document classification, text categorization, decision tree, j48, lineat svm, reuters collection, XML format, stemming, stopwords, document-term matrix
Tutorial: en_Tanagra_Text_Mining.pdf
References :
Wikipedia, "Document classification".
S. Weiss, N. Indurkhya, T. Zhang, "Fundamentals of Predictive Text Mining", Springer, 2010.

Saturday, June 25, 2016

Image classification with Knime

The aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a label to image from their visual content. The whole process is identical to the standard data mining process. We learn a classifier from a set of classified images. Then, we can apply the classifier to a new image in order to predict its class membership. The particularity is that we must extract a vector of numerical features from the image before to launch the machine learning algorithm, and before to apply the classifier in the deployment phase.

We deal with an image classification task in this tutorial. The goal is to detect automatically the images which contain a car. The main result is that, even if I have a basic knowledge about the image processing, I can lead the analysis with a facility which is symptomatic of the usability of Knime in this context.

Keywords: image mining, image classification, image processing, feature extraction, decision tree, random forest, knime
Tutorial: en_Tanagra_Image_Mining_Knime.pdf
Dataset and program (Knime archive): image mining tutorial
Knime Image Processing,
S. Agarwal, A. Awan, D. Roth, « UIUC Image Database for Car Detection » ;

Sunday, June 19, 2016

Gradient boosting (slides)

The "gradient boosting" is an ensemble method that generalizes boosting by providing the opportunity of use other loss functions ("standard" boosting uses implicitly an exponential loss function).

These slides show the ins and outs of the method. Gradient boosting for regression is detailed initially. The classification problem is presented thereafter.

The solutions implemented in the packages for R and Python are studied.

Keywords: boosting, regression tree, package gbm, package mboost, package xgboost, R, Python, package scikit-learn, sklearn
Slides: Gradient Boosting
R. Rakotomalala, "Bagging, Random Forest, Boosting", December 2015.
Natekin A., Knoll A., "Gradient boosting machines, a tutorial", in Frontiers in Neurorobotics, December 2013. 

Monday, June 13, 2016

Tanagra and Sipina add-ins for Excel 2016

The add-ins “tangra.xla” and “sipina.xla” are greatly involved in popularity of Tanagra and Sipina software applications. They incorporate menus dedicated to data mining in Excel. They implement a simple bridge between the data into the spreadsheet and Tanagra or Sipina.

I developed and tested the latest add-ins versions for Excel 2007 and 2010. I had access recently to Excel 2016. I checked the add-ins. The conclusion is that the tools work without a hitch.

Keywords: data importation, excel data file, add-in, add-on, xls, xlsx
Lien : en_Tanagra_Add_In_Excel_2016.pdf
Tanagra, "Tanagra add-in for Excel 2007 and 2010", August 2010.
Tanagra, "Sipina add-in for Excel 2007 and 2010", June 2016.

Sunday, June 12, 2016

Sipina add-in for Excel 2007 and 2010

SIPINA is a Data Mining Software which implements various supervised learning paradigms. This is an old tool but it is still used because this is the only free tool which provides fully functional interactive decision tree capabilities.

This tutorial briefly describes the installation and the use of the add-in "sipina.xla" into Excel 2007. The approach is easily generalized to Excel 2010. A similar document exists for Tanagra . It seemed to me nevertheless necessary to clarify the procedure, especially because several users have made the request. Other tutorials exist for earlier versions of Excel (1997-2003)  and for Calc (Libre Office and Open Office).

A new tutorial will come soon. It shows that the add-in operates properly also under Excel 2016.

Keywords: data importation, excel data file, add-in, add-on, xls, xlsx
Tutorial: en_sipina_excel_addin.pdf
Dataset: heart.xls
Tanagra, "Tanagra add-in for Office 2007 and Office 2010", august 2010.
Tanagra, "Tanagra and Sipina add-ins for Excel 2016", June 2016.

Sunday, April 3, 2016

Categorical predictors in logistic regression

The aim of the logistic regression is to build a model for predicting a binary target attribute from a set of explanatory variables (predictors, independent variables), which are numeric or categorical. They are treated as such when they are numeric. We must recode them when they are categorical. The dummy coding is undeniably the most popular approach in this context.

The situation becomes more complicated when we perform a feature selection. The idea is to determine the predictors that contribute significantly to the explanation of the target attribute. There is no problem when we consider a numeric variable. It is either excluded or either kept in the model. But how to proceed when we handle a categorical explanatory variable? Should we treat the dichotomous variables associated to a categorical predictor as a whole that we must exclude or include into the model? Or should we treat the each dichotomous variable independently? How to interpret the coefficients of the selected dichotomous variables in this case?

In this tutorial, we study the approaches proposed by various tools: R 3.1.2, SAS 9.3, Tanagra 1.4.50 and SPAD 8.0. We will see that feature selection algorithms rely on specific criteria according to the software. We will see also that they use different approaches when we are in the presence of the categorical predictor variables.

Keywords: logistic regression, dummy coding, categorical predictor variables, feature selection
Tutorial: Feature selection - Categorical predictors - Logistic Regression
Dataset: heart-c.xlsx 
Wikipedia, "Logistic Regression"

Thursday, March 31, 2016

Dummy coding for categorical predictor variables

In this tutorial, we show how to perform a dummy coding for categorical predictor variables in the context of the logistic regression learning process.

In fact, this is an old tutorial that I was written a long time ago (2007), but it is not referenced in this blog (which was created in 2008). I found it in my archives because I plan to write soon a tutorial about the strategies for the selection of categorical variables in logistic regression. I was wondering if I had already written something that may be linked to this subject (the treatment of the categorical predictors in logistic regression) in the past. Obviously, I would have to check most often my archives.

We use Tanagra 1.4.50 in this tutorial.

Keywords: logistic regression, dummy coding, categorical predictor variables
Tutorial: Dummy coding - Logistic Regression
Dataset: heart-c.xlsx 
Wikipedia, "Logistic Regression"

Sunday, March 13, 2016

Cost-Sensitive Learning (slides)

This course material presents approaches for the consideration of misclassification costs in supervised learning. The baseline method is the one for which we do not take into account the costs.

Two issues are studied : the metric used for the evaluation of the classifier when a misclassification cost matrix is provided i.e. the expected cost of misclassification (ECM); some approaches which enable to guide the machine learning algorithm towards the minimization of the ECM.

Keywords: cost matrix, misclassification, expected cost of misclassification, bagging, metacost, multicost
Slides: Cost Sensitive Learning
Tanagra Tutorial, "Cost-senstive learning - Comparison of tools", March 2009.
Tanagra Tutorial, "Cost-sensitive decision tree", November 2008.

Thursday, March 3, 2016

Hyper-threading and solid-state drive

After more than 6 years of good and faithful service, I decided to change my computer. It must be said that the former (Intel Core 2 Quad Q9400 2.66 Ghz - 4 cores - running Windows 7 - 64 bit) began to make disturbing sounds. I am obliged to put music to cover the rumbling of the beast and be able to work quietly.

The choice of the new computer was another matter. I spent the age of the race to the power which is necessarily fruitless anyway, given the rapid evolution of PCs. Nevertheless, I was sensitive to two aspects that I could not evaluate previously: The hyper-threading  technology is effective in programming multithreaded algorithms of data mining? The use of temporary files to relieve the memory occupation takes advantage of SSD  disk technology?

The new PC runs under Windows 8.1 (I wrote the French version of this tutorial one year ago). The processor is a Core I7 4770S (3.1 Ghz). It has 4 physical cores but 8 logical cores with the hyper-threading technology. The system disk is a SSD. These characteristics allows evaluate to their influences on (1) the implementation of multithreaded version of the linear discriminant analysis described in a previous paper (“Load balanced multithreading for LDA”, September 2013), where the number of threads used can be specified by the user; (2) the use of temporary files for the induction of decision trees algorithm, which enables us to handle very large dataset (“Dealing with very large dataset in Sipina”, January 2010; up to 9,634,198 instances and 41 variables).

In this tutorial, we reproduce the two studies using the SIPINA software. Our goal is to evaluate the behavior of these solutions (multi-threaded implementation, copy of data into temporary files to alleviate the memory occupation) on our new machine which, due to its characteristics, should expressly take advantage of them.

Keywords:  hyper-threading, ssd disk, solid-state drive, multithread, multithreading, very large dataset, core i7, sipina, decision tree, linear discriminant analysis, lda
Tutorial: en_Tanagra_Hyperthreading.pdf
Tanagra Tutorial, "Load balanced multithreading for LDA", September 2013.
Tanagra Tutorial, "Dealing with very large dataset in Sipina", January 2010.
Tanagra Tutorial, "Multithreading for decision tree induction", November 2010.

Monday, January 4, 2016

Tanagra website statistics for 2015

The year 2015 ends, 2016 begins. I wish you all a very happy year 2016.

A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 255,386 times this year, 700 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there are 1,848,033 visits (639 visits per day).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, Brazil, ...

Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Science: course materials, tutorials, links to other documents available on line, etc.. This is not really surprising. I take more time myself to write booklets and tutorials, to study the behavior of different tools, of which Tanagra.

Happy New Year 2016 to all.

Slideshow: Website statistics for 2015