Tanagra - Data Mining and Data Science Tutorials: September 2011

Sunday, September 25, 2011

A PRIORI PT updated

A PRIORI PT is a tool dedicated for the extraction of association rules. This is one of the few components of Tanagra based on external library. We use the Borgelt's "apriori.exe" program. Until the version 1.4.40 of Tanagra, we used the 4.31 version of "apriori.exe". From the Tanagra 1.4.41, we introduce the latest update 5.57 (2011/09/02). Even if the settings of the tool are slightly modified, we observe that the extracted rules and the readings of the results are identical.

We take again a former tutorial to describe the behavior of this component (Association Rule Learning using A PRIORI PT). Thus, we do not detail the construction of the diagram here. We try above all to highlight the improvement of the library, especially about the computation time. We observe that this improvement is really impressive.

Keywords: association rule, large dataset
Components: A priori PT
Tutorial: en_Tanagra_AprioriPT_Updated.pdf
Dataset: assoc_census.zip
Reference: C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"

Thursday, September 22, 2011

Tanagra - Version 1.4.41

A PRIORI PT. This component generates association rules. It is based on the Borgelt's apriori.exe program which has been recently updated (2011/09/02 - 5.57 version). The improvement of this new version, in terms of calculation time, is impressive.

FREQUENT ITEMSETS. Also based on the Borgelt's apriori.exe program (version 5.57), this component generates frequent (or closed, maximum, generators) itemsets.

Some tutorials are coming soon to describe the use of these new tools.

Donwload page : setup

Tuesday, September 20, 2011

New GUI for RapidMiner 5.0

RapidMiner is a very popular data mining tool. It is (one of) the most used by the data miners according to the annual Kdnuggets polls (2011, 2010, 2009, 2008, 2007). There are two versions. We describe here the Community Edition which freely downloadable from the editor's website.

The new RapidMiner 5.0 has a new graphical user interface which is very similar to that of Knime. The organization of the workspace is the same. The sequence of data processing (using operators) is described with a diagram called "process" into the RapidMiner documentation. In fact, this version 5.0 joined the presentation adopted by the vast majority of data mining software. Some features are shared with many tools, among others: the connection to the R software; the meta-nodes which implements a loop or a standard succession of operations; the description of the methods underlying operators which is continuously in the right part of the main window.

RapidMiner 5.0 having evolved substantially (compared with previous versions e.g. the version 4.6 described in one of our tutorials). I thought it was appropriate to study this in detail, evaluating its behavior in the context of a standard data mining analysis. We want to implement the following process: (1) creating a decision tree from a labeled dataset; (2) exporting the model (the classification tree) into a external file (PMML format) in order to a deployment thereafter; (3) assessing the model performance using a cross-validation resampling scheme; (4) applying the model on a set of unlabeled instances, the results, i.e. the values of the descriptors and the assigned class, must be exported into a CSV file. These are standard data mining tasks. We have described them in many tutorials. We want to check if it is easy to implement them with this new version of RapidMiner. Indeed, with the previous version, defining some sequences of operations was complicated. Implementing a cross-validation for instance was not really intuitive.

Keywords: rapidminer, knime, cross-validation, decision tree, classification tree, deployment
Tutorial: en_Tanagra_RapidMiner_5.pdf
Dataset: adult_rapidminer.zip
References:
Rapid-I, "RapidMiner"
Knime, "Knime Desktop"

Monday, September 19, 2011

Regression model deployment

Model deployment is one of the main objectives of the data mining process. We want to apply a model learned on a training set on unseen cases i.e. any people coming from the population. In the classification framework, the aim is to assign to the instance its class value from their description [e.g. Apply a classifier on a new dataset (Deployment)]. In the clustering framework, we try to detect the group which is as similar as possible to the instance according their characteristics (e.g. K-Means - Classification of a new instance).

We are concerned about the regression framework here . The aim is to predict the values of the dependent variable for unseen instances (or unlabeled instances) from the observed values on the independent variables. The process is rather basic if we handle a linear regression model. We apply the computed parameters on the unseen instances. But, it becomes difficult when we want to treat more complex models such as support vector regression with nonlinear kernels, or the models elaborated from a combination of techniques (e.g. regression from the factors of a principal component analysis). In this context, it is essential that the deployment process is directly ensured by the data mining tool.

With Tanagra, it is possible to easily deploy the regression models, even when they are the result of a combination of technique. Simply, we must prepare the data file in a particular way. In this tutorial, we describe below how to organize the data file in order to deploy various models in an unified framework: a linear regression model, a PLS regression model, a support vector regression model with a RBF (radial basis function) kernel, a regression tree model , a regression model from the factors of a principal component analysis. Then, we export the results (the predicted values for the dependent variable) in a new data file. Last, we check if the predicted values are similar according to the various models.

Keywords: model deployment, linear regression, pls regression, support vector regression, SVR, regression tree, cart, principal component analysis, pca, regression of factor scores
Components: MULTIPLE LINEAR REGRESSION, PLS REGRESSION, PLS SELECTION, C-RT REGRESSION TREE, EPSILON SVR, PRINCIPAL COMPONENT ANALYSIS, RECOVER EXAMPLES, EXPORT DATASET, LINEAR CORRELATION
Tutorial: en_Tanagra_Multiple_Regression_Deployment.pdf
Dataset: housing.xls
References :
R. Rakotomalala, Régression linéaire multiple - Diaporama (in French)