Tanagra - Data Mining and Data Science Tutorials: August 2011

Saturday, August 27, 2011

Data Mining with R - The Rattle Package

R (http://www.r-project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is absolutely justified (see Kdnuggets Polls - Data Mining/ Analytic Tools Used - 2011). Among the reasons which explain this success, we distinguish two very interesting characteristics: (1) we can extend almost indefinitely the features of the tool with the packages; (2) we have a programming language which allows to perform easily sequences of complex operations.

But this second property can be also a drawback. Indeed, some users do not want to learn a new programming language before being able to realize projects. For this reason, tools which allow to define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still remain a valuable alternative with the data miners.

In this tutorial, we present the "Rattle" package which allows to the data miners to use R without needing to know the associated programming language. All the operations are performed with simple clicks, such as for any software driven by menus. But, in addition, all the commands are stored. We can save them in a file. Then, in a new working session, we can easily repeat all the operations. Thus, we find one of the important properties which miss to the tools driven by menus.

To describe the use of the rattle package, we perform an analysis similar to the one suggested by the rattle's author in its presentation paper (G.J. Williams, " Rattle : A Data Mining GUI for R", in The R Journal, volume 1 / 2, pages 45-55, December 2009). We perform the following steps: loading the data file; partitioning the instances into learning and test samples; specifying the types of the variables (target or input); computing some descriptive statistics; learning the predictive models from the learning sample; assessing the models on the test sample (confusion matrix, error rate, some curves).

Keywords: R software, R project, rpart, random forest, glm, decision tree, classification tree, logistic regression
Tutorial: en_Tanagra_Rattle_Package_for_R.pdf
Dataset: heart_for_rattle.txt
References:
Togaware, "Rattle"
CRAN, "Package rattle - Graphical user interface for data mining in R"
G.J. Williams, "Rattle: A Data Mining GUI for R", in The R Journal, Vol. 1/2, pages 45--55, december 2009.

Monday, August 22, 2011

Predictive model deployment with R (filehash)

Model deployment is the last task of the data mining steps. It corresponds to several aspects e.g. generating a report about the data exploration process, highlighting the useful results; applying models within an organization's decision making process; etc .

In this tutorial, we look at the context of predictive data mining. We are concerned about the construction of the model from a labeled dataset; the storage of the model; the distribution of the model, without the dataset used for its construction; the application of the model on new instances in order to assign them a class label from their description (the values of the descriptors).

We describe the filehash package for R which allows to deploy a model easily. The main advantage of this solution is that R can be launched under various operating systems. Thus, we can create a model with R under Windows; and apply the model in another environment, for instance with R under Linux. The solution can be easily generalized on a large scale because it is possible to launch R in batch mode. The update of the system will concern only the model file in the future.

We will write three R programs to distinguish the steps of the deployment process. The first one constructs a model from the dataset and stores it into a binary file (filehash format). The second one loads the model in another R session and uses it to label new instances from a second data file. The predictions are stored in a data file (CSV file format). Last, the third program loads the predictions and another data file containing the observed labels for these instances, and calculates the confusion matrix and the generalization error rate.

We use various predictive models in order to check the flexibility of the solutions. We tried the following ones: decision tree (rpart); logistic regression (glm); linear discriminant analysis (lda); linear discriminant analysis from factors of principal component analysis (lda + pca). This last one allowed to check if the system remains operational when we manipulate a combination of models.

Keywords: R software, filehash package, deployment, predictive model, rpart, lda, pca, glm, decision tree, linear discriminant analysis, logistic regression, principal component analysis, linear discriminant analysis on latent variables
Tutorial: en_Tanagra_Deploying_Predictive_Models_with_R.pdf
Dataset: pima-model-deployment.zip
References:
R package, "Filehash : Simple key-value database"
Kdnuggets, "Data mining deployment Poll"

Thursday, August 18, 2011

REGRESS into the SIPINA package

Few people know it. In fact, several tools are installed when we launch the SETUP file of SIPINA (setup_stat_package.exe). This is the case of REGRESS which is intended to multiple linear regression.

Even if a multiple linear regression procedure is incorporated to Tanagra, REGRESS can be useful essentially because it is very easy to use. It has the advantage of being very easy to handle while being consistent with a degree course in Econometrics. As such, it may be useful for anyone wishing to learn about the regression without too much get involved in the learning of a new software.

Keywords: regress, econometrics, multiple linear regression, outliers, influential points, normality tests, residuals, Jarque-Bera test, normal probability plot, sipina.xla, add-in
Tutorial: en_sipina_regress.pdf
Dataset: ventes-regression.xls
References:
R. Rakotomalala, "Econométrie - Régression Linéaire Simple et Multiple".
D. Garson, "Multiple regression".

Sunday, August 14, 2011

PLS Regression - Software comparison

Comparing the behavior of tools is always a good way to improve them.

To check and validate the implementation of methods. The validation of the implemented algorithms is an essential point for data mining tools. Even if two programmers use the same references (books, articles), the programming choice can modify the behavior of the approach (behaviors according to the interpretation of the convergence conditions for instance). The analysis of the source code is possible solution. But, if it is often available for free software, this is not the case for commercial tools. Thus, the only way to check them is to compare the results provided by the tools on a benchmark dataset . If there are divergences, we must explain them by analyzing the formulas used.

To improve the presentation of results. There are certain standards to observe in the production of reports, consensus initiated by reference books and / or leader tools in the field. Some ratios should be presented in a certain way. Users need reference points.

Our programming of the PLS approach is based on the Tenenhaus book (1998) which, itself, make reference to the SIMCA-P tool. Using the access to a limited version of this software (version 11), we have check the results provided by Tanagra on various datasets. We show here the results of the study on the CARS dataset. We extend the comparison to other data mining tools.

Keywords: pls regression, software comparison, simca-p, spad, sas, r software, pls package
Components: PLSR, VIEW DATASET, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL
Tutorial: en_Tanagra_PLSR_Software_Comparison.pdf
Dataset: cars_pls_regression.xls
References :
M. Tenenhaus, « La régression PLS – Théorie et pratique », Technip, 1998.
D. Garson, « Partial Least Squares Regression », from Statnotes: Topics in Multivariate Analysis.
UMETRICS. SIMCA-P for Multivariate Data Analysis.

Saturday, August 6, 2011

The CART method under Tanagra and R (rpart)

CART (Breiman and al., 1984) is a very popular classification tree (says also decision tree) learning algorithm. Rightly. CART incorporates all the ingredients of a good learning control: the post-pruning process enables to make the trade-off between the bias and the variance; the cost complexity mechanism enables to "smooth" the exploration of the space of solutions; we can control the preference for simplicity with the standard error rule (SE-rule); etc. Thus, the data miner can adjust the settings according to the goal of the study and the data characteristics.

The Breiman's algorithm is provided under different designations in the free data mining tools. Tanagra uses the "C-RT" name. R, through a specific package , provides the "rpart" function.

In this tutorial, we describe these implementations of the CART approach according to the original book (Breiman and al., 1984; chapters 3, 10 and 11). The main difference between them is the implementation of the post-pruning process. Tanagra uses a specific sample says "pruning set" (section 11.4); when rpart is based on the cross-validation principle (section 11.5) .

Keywords: decision tree, classification tree, recursive partitioning, cart, R software, rpart package
Components: DISCRETE SELECT EXAMPLES, C-RT, SUPERVISED LEARNING, TEST
Tutorial: en_Tanagra_R_CART_algorithm.pdf
Dataset: wave5300.xls
References:
Breiman, J. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Chapman & Hall, 1984.
"The R project for Statistical Computing" - http://www.r-project.org/