Tanagra - Data Mining and Data Science Tutorials: December 2008

Friday, December 26, 2008

Logistic regression - Software comparison

Logistic regression is a popular supervised learning method.

There are several reasons for this. The theoretical foundation of the method is attractive. It is in line with the generalized regression. Thus the logistic regression is a well identified variant which one can implement according the kind of the dependent variable (class attribute). Their performance in prediction is comparable to the other approaches. Furthermore, it puts forward some indicators for the interpretation of the results. Among them, the famous odds-ratio enables to identify precisely the contribution of each predictor.

Logistic regression is available in many free tools. In this tutorial, we compare the implementation of this technique with Tanagra 1.4.27, R 2.7.2 (GLM command), Orange 1.0b2, Weka 3.5.6, and the package RWeka 0.3-13 for R. Beyond the comparison, this tutorial is also an opportunity to show how to achieve the succession of operations with these tools: importing an ARFF file (Weka file format); split the data into learning and test set; computing the predictive model on the learning set; testing the model on the test set; selecting the relevant variable using criterion in agreement with the logistic regression; evaluating again the performance of the simplified model.

Keywords: logistic regression, supervised learning, software comparison
Components: BINARY LOGISTIC REGRESSION, SUPERVISED LEARNING, TEST, DISCRETE SELECT EXAMPLES
Tutorial: en_Tanagra_Perfs_Reg_Logistique.pdf
Dataset: wave_2_classes_with_irrelevant_attributes.zip
References:
D. Garson, "Logistic Regression"
Wikipedia, "Logistic Regression"

Tuesday, December 23, 2008

Association rule mining - Software comparison

This document extends a previous tutorial dedicated to the comparison of various implementations of association rules mining (http://data-mining-tutorials.blogspot.com/2008/10/association-rule-learning.html). We had analyzed Tanagra, Orange and Weka. We extend here the comparison to R, RapidMiner and Knime.

We handle an attribute-value dataset. It is not the usual data format for the association rule mining where the "native" format is rather the transactional database. We see in this tutorial than some of tools can automatically recode the data. Others require an explicit transformation. Thus, we must find the right components and the correct sequence of treatments to produce the transactional data format. The process is not always easy according to the software.

The tools studied in this tutorial are: Tanagra 1.4.28, R 2.7.2 (arules package 0.6-6), Orange 1.0b2, RapidMiner Community Edition, Knime 1.3.5 and Weka 3.5.6. These programs load the data and perform the calculations in memory. When the size of the database increases, the real bottleneck is the memory available on our personal computer.

Keywords: association rule, frequent itemset
Components: A PRIORI, A PRIORI PT
Tutorial: en_Tanagra_Assoc_Rules_Comparison.pdf
Dataset: credit-german.zip
References:
R. Rakotomalala, « Règles d’association »
Wikipedia, "Association rule learning"

Saturday, December 20, 2008

K-Means - Classification of a new instance

The deployment is an important step of the Data Mining framework. In the case of a clustering, after the construction of clusters with a learning algorithm, we want to determine to which particular cluster (group) a new unlabelled instance belongs.

In this tutorial, we use the K-Means algorithm. We assign each new instance to the group which is closest using the distance to the center of groups. The method is fair because the technique used to assign a group in the deployment phase is consistent with the learning algorithm. It is not true if we use another learning algorithm e.g. HAC (hierarchical agglomerative clustering) with de single linkage aggregation rule. The distance to the center of groups is inadequate in this context. Thus, the classification strategy must be consistent with the learning strategy.

All the descriptors are discrete in our dataset. The K-Means algorithm does not handle directly this kind of data. We must transform them. We use a multiple correspondence analysis algorithm.

In this tutorial, we compare the results of Tanagra 1.4.28 and R 2.7.2.

Keywords: data clustering, k-means, multiple correspondence analysis, factorial analysis, clusters interpretation, data exportation
Components: MULTIPLE CORRESPONDENCE ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, CONTINGENCY CHI-SQUARE, EXPORT DATASET
Tutorial: en_Tanagra_KMeans_Deploiement.pdf
Dataset: banque_classif_deploiement.zip
References:
Wikipedia (en), « K-Means algorithm ».
F. Husson, S. Lê, J. Josse, J. Mazet, « FactoMineR – A package dedicated to Factor Analysis and Data Mining with R ».