Tanagra - Data Mining and Data Science Tutorials: November 2008

Thursday, November 13, 2008

Decision tree and large dataset

Dealing with large dataset is on of the most important challenge of the Data Mining. In this context, it is interesting to analyze and to compare the performances of various free implementations of the learning methods, especially the computation time and the memory occupation. Most of the programs download all the dataset into memory. The main bottleneck is the available memory.

In this tutorial, we compare the performance of several implementations of the C4.5 algorithm (Quinlan, 1993) when processing a file containing 500,000 observations and 22 variables. The programs used are: Knime 1.3.5; Orange 1.0b2; R (rpart package) 2.6.0; RapidMiner Community Edition; Sipina Research; Tanagra 1.4.27; Weka 3.5.6.

Our data file is well-known artificial dataset described in the CART book (Breiman et al., 1984). We have generated a dataset with 500.000 observations. The class attribute has 3 values, there are 21 continuous predictors.

Keywords: c4.5, decision tree, classification tree, large dataset, knime, orange, r, rapidminer, sipina, tanagra, weka
Components: SUPERVISED LEARNING, C4.5
Tutorial: en_Tanagra_Perfs_Comp_Decision_Tree.pdf
Dataset: wave500k.zip
Reference: R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.

Tuesday, November 11, 2008

Decision tree and cross validation (continued)

In a previous tutorial, we compare the implementation of the decision tree induction and cross validation evaluation performances with three programs: TANAGRA, ORANGE and WEKA.

In this paper, we extent the same framework for the comparison of three new programs: R 2.7.2, KNIME 1.3.51 and RAPIDMINER Community Edition.

Keywords: supervised learning, decision tree, classification tree, classifier assessment
Components: Supervised learning, C-RT, Cross validation
Tutorial: en_Tanagra_Validation_Croisee_Suite.pdf
Dataset: heart.zip

Monday, November 10, 2008

Interactive induction of decision tree

Interactive induction of decision trees with SIPINA.

Various functionalities of SIPINA are not documented. In this tutorial, we show how to explore nodes of a decision tree, in order to obtain a better understanding of the characteristics of the subpopulation on a node. This is an important task, for instance when we want to validate the rules with an expert domain.

Keywords: decision tree, classification tree, interactive analysis
Tutorial: en_sipina_interactive.pdf
Dataset: blood_pressure_levels.xls

Decision tree and contextual descriptive statistics

SIPINA proposes some descriptive statistics functionalities. In itself, the information is not really exceptional; there is a large number of freeware which do that.

It becomes more interesting when we combine these tools with the decision tree. The exploratory phase is improved. Indeed, every node of the tree corresponds to a subpopulation. The variables which do not appear in the tree are not necessarily irrelevant. Perhaps, some of them were hided during the tree learning which selects the “best” variables. By computing contextual descriptive statistics, in connection with the each node, we better understand the prediction rules highlighted during the induction process.

Keywords: descriptive statistics, decision tree, interactive exploration
Tutorial: en_sipina_descriptive_statistics.pdf
Dataset: heart_disease_male.xls

Cost-sensitive Decision Tree

Error rate evaluation is a key point of the induction process. A usual approach is to partition the dataset in a learning set, which is used for the induction of the classification model, and in a test set, which is used for the performance evaluation.

The first subject of this tutorial is to show how to make a partition of the dataset with SIPINA. Then, we build the tree on the first part of the dataset. Later, we classify the examples of the second part of the dataset. We compare the predicted value and the true value. We obtain honest error rate estimation.

The second main subject of this document is to show how to take into account the misclassification costs during the learning process and the evaluation process. We use a slightly modified version of C4.5 (Quinlan, 1993).

Keywords: decision trees, C4.5, classifier evaluation, cost-sensitive learning, F-Measure, spams detection
Tutorial: en_sipina_cost_sensitive.pdf
Dataset: spam.xls

Semi-partial correlation

The semi-partial correlation measures the additional information of an independent variable (X), compared with one or several control variables (Z1,..., Zp), that we can used for the explanation of a dependent variable (Y).

We can compute the semi-partial correlation in various ways. The square of the semi-partial correlation can be obtained with the difference between the square of the multiple correlation coefficient of regression Y / X, Z1...,Zp (including X) and the same quantity for the regression Y / Z,...,Zp (without X).

We can also obtain the semi-partial correlation by computing the residuals of the regression X/Z1,...,Zp; then, we compute the correlation between Y and these residuals. In other words, we seek to quantify the relationship between X and Y, by removing the effect of Z on the latter. The semi-partial correlation is an asymmetrical measure.

In this tutorial, we show the different ways for computing the semi-partial correlation.

Keywords: correlation, Pearson's correlation, semi-partial correlation, multiple linear regression
Components: LINEAR CORRELATION, MULTIPLE LINEAR REGRESSION, SEMI-PARTIAL CORRELATION
Tutorial: en_Tanagra_Semi_Partial_Correlation.pdf
Dataset: cars_semi_partial_correlation.xls
Reference: M. Brannick, « Partial and Semipartial Correlation », University of South Florida.

Partial correlation

Partial correlation measures the degree of association between two random variables, with the effect of a set of controlling variables removed.

In this tutorial, we show how to use the PARTIAL CORREALTION component of Tanagra. We reproduce the example described online (see Reference). Thus, in addition to the presentation of the theoretical method, we can trace the detail of all the calculations that we will achieve.

Keywords: correlation, Pearson's correlation, rank correlation, Spearman's rho, partial correlation
Components: LINEAR CORRELATION, SPEARMAN’S RHO, PARTIAL CORRELATION
Tutorial: en_Tanagra_Partial_Correlation.pdf
Dataset: wechsler_adult_intelligence_scale.xls
Reference: S. Rathbun, A. Wiesner, « STAT 505 – Applied Multivariate Statistical Analysis », The Pennsylvania State University, Lesson 7 : Partial Correlations

Friedman Anova by Ranks

In this tutorial, we show how to use the FRIEDMAN’S ANOVA BY RANKS component. We use this test when we want to check the null hypothesis that K related (matched) samples come from the same population. That matching can be achieved by studying the same group of individuals under each of the K conditions (repeated measure with various conditions).

Keywords: comparison of population, matched samples, analysis of variance, ranks, nonparametric, ANOVA
Components: Friedman’s ANOVA by Rank, One-way ANOVA, Kruskal-Wallis 1-way ANOVA
Tutorial: en_Tanagra_Friedman_Anova.pdf
Dataset: howell_book_friedman_anova_dataset.zip
Reference: Wikipedia, « Friedman test »

Manova - Multivariate analysis of variance

In this tutorial, we show how to use the ONE WAY MANOVA component (Multivariate Analysis of Variance): unlike classical ANOVA, there is more than one dependent variables.

We will see that a multivariate test and a combination of univariate tests give a different conclusion.

Keywords: multivariate analysis of variance, variance covariance matrix
Components: One-way ANOVA, One-Way MANOVA
Tutorial: en_Tanagra_Manova.pdf
Dataset: tomassone_p_29.xls
Reference:
S. Rathburn, A. Wiesner, "STAT 505 - Applied Multivariate Statistical Analysis", Penn State University, Departement of Statistics.

Normality test

A goodness-of-fit test is used to decide if a sample comes from a population with specific distribution. TANAGRA has a new component, which uses several tests in order to check the normality assumption.

We use artificial dataset in this tutorial, we have generated the dataset from 3 distributions: uniform, normal and log-normal.

Keywords: Shapiro-Wilk's test, Lilliefors' test, Anderson-Darling's test, d’Agostino's test
Components: More Univariate cont stat, Normality Test
Tutorial: en_Tanagra_Normality_Test.pdf
Dataset: normality_test_simulation.xls
Reference:
Wikipedia - "Normality test"

Two-sample t-test

In this tutorial, we show how to use TANAGRA to determine if two populations means are equal. The conditional variance may be assumed as equal or unequal i.e. we use pooled or separate variance estimation.

Keywords: test for mean comparison, Student's t-test
Components: T-Test, T-Test Unequal Variance
Tutorial: en_Tanagra_Two_Sample_T_Test_For_Equal_Means.pdf
Dataset: auto83b.xls
Reference: NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.3.1 « Do two processes have the same mean ? ».

ANOVA and test for equality of variances

In this tutorial, we show how to use TANAGRA in an analysis of variance problem. We test also homogeneity of variances assumption on the same dataset.

We use the GEAR dataset (NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/). We consider that we have 10 machine tools, which produce gears. We have 10 batches of 10 observations. We want to test various assumptions: (1) the average diameter of the gears is the same one for the whole of the machines? (2) The variability of the gear diameter is the same for the whole of the machines?

Keywords: analysis of variance, test for equality of variances, bartlett's test, levene's test, brown-forsythe's test
Components: One-way ANOVA, Bartlett’s test, Levene’s test, Brown-Forsythe test
Tutorial: Anova and Tests for Equality of Variances
Dataset: gear_data_from_nist.xls
Reference: NIST/SEMATECH, « e-Handbook of Statistical Methods », Chapitre 7 « Product and Process Comparisons ».

Sunday, November 9, 2008

Nonparametric statistics

In this tutorial we show how to implement some nonparametric statistics technique with Tanagra.

Various approaches are available: difference between populations for independent and related samples, nonparametric correlations, measures of association between nominal variables, etc.

Keywords: tests for independent samples, tests for related samples, analysis of variance, measures of association, wald and wolfowitz runs test, mann and whitney test, kruskal and wallis, spearman's rho, kendall's tau, sign test, wilcoxon signed rank test
Components: Mann-Whitney Comaprison, Wald-Wolfowitz Runs Test, Kruskal-Wallis 1-way ANOVA, One-way ANOVA, Spearman’s rho, Kendall’s tau, Sign Test, Wilconxon Ranks Test, Paired T-Test
Tutorial: en_Tanagra_Nonparametric_Statistics.pdf
Dataset: nonpametric_statistics_dataset.xls
References:
S. Siegel, J. Castellan, « Nonparametric Statistics for the Behavioral Sciences », McGraw-Hill, 1988.
D. Sheskin, "Handbook of parametric and nonparametric statistical procedures", Chapman & Hall, 2007.

Measures of association for ordinal variables

In this tutorial, we show how to use TANAGRA for measuring the association between ordinal variables.

All the measures that we present here rely on the concept of pairs. If, in a theoretical point of view, the measures intended for continuous attributes such as correlation are not convenient in our context, in the practical point of view, we display, in this tutorial, that it nevertheless gives interesting results for the studying the dependence between ordinal variables.

Keywords: concordant and discordant pairs, contingency table, goodman and kruskal's gamma, kendall's tau-c, sommers's d
Components: Goodman Kruskal Gamma, Kendall Tau-c, Sommers d, Linear Correlation
Tutorial: en_Tanagra_Measures_of_Association_Ordinal_Variables.pdf
Dataset: blood_pressure_ordinal_association.xls
Reference:
D. Garson, « Measures of association », in Statnotes : Topics in Multivariate Analysis.

Measures of association for nominal variables

To measure the association between two continuous variables, we generally use the correlation coefficient. Its drawbacks and its qualities are well known.

When we want to characterize the association for nominal variables, the correlation coefficient is not suitable. We must use other indicators. The most widespread is certainly the chi-square test, it enables to evaluate the absence of relation. We see in this tutorial that other measures are available. We show how to use them with TANAGRA.

Keywords: association between nominal variables, contingency table, chi-square test, tschuprow's t, cramer's v, asymmetrical association, pre measures (proportional reduction in error), goodman and kruskal's tau, theil's u, partial association, partial theil's u
Components: Contingency Chi-Square, Goodman-Kruskal Tau, Theil U, Partial Theil U, Discrete select examples
Tutorial: en_Tanagra_Measures_of_Association_Nominal_Variables.pdf
Dataset: fuel_consumption.xls
Reference:
D. Garson, « Measures of association », in Statnotes : Topics in Multivariate Analysis.

Correlation coefficient

In this tutorial, we show how to compute the correlation coefficient and sorting the results according to this indicator is a recurring task of the data miner.

We show how to quickly set up the calculation of the linear correlation (1) of an endogenous variable with exogenous variables in order to detect relevant attributes; (2) between exogenous variables in order to detect collinearities.

Keywords: linear correlation coefficient, partial correlation
Components: Linear correlation, Residual scores
Tutorial: en_Tanagra_Linear_Correlation.pdf
Dataset: cars_acceleration.xls
References:
D. Garson, « Correlation », in Statnotes : Topics in Multivariate Analysis.
D. Garson, « Partial correlation », in Statnotes : Topics in Multivariate Analysis.

Descriptive statistics

In this tutorial, we show how to compute univariate descriptive statistics for continuous and discrete variables.

We use mainly tabular descriptions and summary statistics.

Keywords: statistique descriptive
Components: View dataset, Univariate continuous stat, Univariate discrete stat, Group characterization
Tutorial: enBasics.pdf
Dataset: breast.txt
References:
Wikipedia - "Descriptive Statistics"
M. Chow, L. Strauss, "STAT 500 - Applied Statistics", Penn State University, Departement of Statistics.

Saturday, November 8, 2008

PLS Regression - Number of factors

In this tutorial, we show how to detect the right number of factors for a PLS regression using a resampling approach.

Standard criteria such as PRESS or Q2 are used. The upstream component (PLS Regression and derivated components) can be automatically updated.

Keywords: pls regression, factor analysis
Components: PLS Factorial, PLS Selection
Tutorial: en_Tanagra_PLS_Selecting_Factors.pdf
Dataset: protien.txt

Clustering - The EM algorithm

In the Gaussian mixture model-based clustering, each cluster is represented by a Gaussian distribution. The entire dataset is modeled by a mixture (a linear combination) of these distributions.

The EM (Expectation Maximization) algorithm is used in practice to find the “optimal” parameters of the distributions that maximize the likelihood function.

The number of clusters is a parameter of the algorithm. But we can also detect the “optimal” number of clusters by evaluating several values, i.e. testing 1 cluster, 2 clusters, etc. and choosing the best one (which maximizes the likelihood or another criterion such as AIC or BIC).

Keywords: clustering, expectation maximization algorithm, gaussian mixture model
Components: EM-Clustering, K-Means, EM-Selection, scatterplot
Tutorial: en_Tanagra_EM_Clustering.pdf
Dataset: two_gaussians.xls
Reference:
Wikipédia (en) -- Expectation-maximization algorithm

Combining clustering and graphical approaches

In this tutorial, we show the complementarity between a clustering method (HAC - Hierarchical agglomerative clustering) and a factor analysis approach for multivariate data visualization (PCA - Principal Component Analysis).

The aim is to obtain a better understanding of underlying concept organizing the data.

Keywords: HAC, clustering, PCA, factor analysis, statistical graphics, visualizing multivariate data
Components: HAC, Group characterization, Principal component analysis, correlation scatterplot, scatterplot
Tutorial: en_Tanagra_hac_pca.pdf
Dataset: cars.xls

Clustering trees

The aim of clustering is to build groups of individuals so that, the examples in the same group are similar, the examples in different groups are dissimilar.

Top down induction of clustering trees adapts the supervised decision/regression trees framework towards clustering. The groups are built by recursive partitioning of the dataset, the internal nodes of the tree are classically split with input attributes. The obtained model, the clustering tree, describes the groups; the learning algorithm selects automatically the relevant attributes.

The clustering trees approach is not very known; we show in this tutorial the interesting properties of this method. Our main references are the papers of Chavent (1998) and Blockeel (1998).

Keywords: clustering algorithm, clustering tree, groups characterization
Components: Multiple Correspodance Analysis, CTP, Contingency Chi-Square, K-Means
Tutorial: en_Tanagra_Clustering_Tree.pdf
Dataset: zoo.xls
References:
M. Chavent (1998), « A monothetic clustering method », Pattern Recognition Letters, 19, 989—996.
H. Blockeel, L. De Raedt, J. Ramon (1998), « Top-Down Induction of Clustering Trees », ICML, 55—63.

Interactive Group Exploration

Most of the time, the statistician must build groups of individuals and want to characterize them. The main interest of this very simple approach is that the results are easy to read and understand.

In this tutorial, we show how to build groups with some (target) attributes, and describe them with other (input) attributes. These component can be useful when we want to outline groups induced with a clustering algorithm for instance.

Keywords: visual group exploration, group characterization
Components: Group characterization, Group Exploration
Tutorial: en_Tanagra_Group_Exploration.pdf
Dataset: autos.xls

Canonical discriminant analysis

We show how to use the CANONICAL DISCRIMINANT ANALYSIS component.

One of the goals of this method is to produce new variables (“latent” variables) from a set of examples classified into predefined classes. These new variables optimize the separation between groups.

This approach can be seen as a sophisticated graphical method. We show mainly these graphical capabilities in this tutorial.

Keywords: canonical discriminant analysis, latent variables, visualization technique
Components: Canonical discriminant analysis, Scatterplot
Tutorial: en_Tanagra_Canonical_Discriminant_Analysis.pdf
Dataset: wine_quality.xls
Reference: D. Garson, "Statnotes: Topics in Multivariate Analysis - Discriminant Function Analysis".

Variable clustering (VARCLUS)

Variable clustering can be viewed like a clustering of the individuals where we would have transposed the dataset. But, instead of the utilization of the euclidean distance in order to compute the similarities between examples, we use the correlation coefficient (or the squared correlation coefficient).

Variable clustering may be useful in several situations. It can be used in order to detect the main dimensionality in the dataset; it may be used also in a feature selection process, in order to select the most relevant attributes for the subsequent analysis. The synthesized variable which represents a group, the main factor of PCA (Principal Component Analysis), may be used also.

Keywords: variable clustering, latent variables
Components: VARHCA, VARKMeans, VARCLUS
Tutorial: en_Tanagra_VarClus.pdf
Dataset: crime_dataset_from_DASL.xls
References:
E. Vigneau et E. Qannari, « Clustering of variables around latent components », Simulation and Computation, 32(4), 1131-1150, 2003.
SAS OnlineDoc – Version 8, « The VARCLUS Procedure ».

K-Means algorithm on discrete attributes

In this tutorial, we show how to perform a K-Means clustering. We validate the results by comparing the clusters with a predefined classification.

We address an additional problem in this tutorial. Descriptors are categorical. We can not directly launch the K-Means with the usual Euclidean distance. We propose to use in 2 steps: (1) transform the original dataset using a correspondence analysis; (2) launch the K-Means on the X first latent variables. We then can use the algorithm standard K-Means based on Euclidean distance in this second step.

Keywords: clustering, k-means, correspondence analysis, cluster description
Components : Multiple Correspondance Analysis, K-Means, Group characterization, Cross Tabulation
Tutorial: en_dr_clustering_validation_externe.pdf
Dataset: dr_vote.bdm
References:
Wikipédia, « K-means algorithm ».
Statsoft Inc., "Correspondence Analysis".

Friday, November 7, 2008

HAC and Hybrid Clustering

HAC (Hierarchical Agglomerative Clustering) is a clustering method that produces “natural “ groups of examples characterized by attributes. A tree, called dendrogram, where successive agglomerations are showed, starting from one example per cluster, until the whole dataset belong to one cluster, describes the clustering process.

The main advantage of HAC is the user can guess the right partitioning by visualizing the tree, he usually prune the tree between nodes presenting an important variation. The main disadvantage is that requires the computation of distances between each example, which is very time consuming when the dataset size increases.

TANAGRA implements the standard HAC, but it implements also a variation of HAC called HYBRID CLUSTERING. Knowing that we need often a very few number of clusters, the construction of the low part of the tree is reserved for a fast method.

There are two steps in the new algorithm:
• First, a low-level clusters are built from fast clustering method such as K-MEANS, SOM;
• HAC starts form these clusters and builds the dendogram.

Note that any clustering algorithm can provide the low level clusters, users can also specify them. Last, rather than the tree itself, it is the gap between the nodes which is important, these values are provided in a table.

Keywords: clustering, unsupervised learning, HAC, K-Means
Components: HAC , K-Means, Group characterization
Tutorial: enHAC_IRIS.pdf
Dataset: iris_hac.bdm
References:
L. Lebart, A. Morineau, M. Piron, " Statistique exploratoire multidimensionnelle ", Dunod, 2000 ; pp. 177 - 184.
Matteo Matteucci, "A tutorial on Clustering Algorithms - Hierarchical Clustering Algorithms".

Correspondence analysis

Correspondence analysis is a visualization technique. It enables to see the association between rows and columns in a large contingency table. It belongs to "factorial analysis" approach. The method computes some axes, which are latent variables that we interpret in order to understand the proximities between rows and/or columns.

TANAGRA is not really intended for contingency table. So we use an artifice. The rows are specified from a discrete attribute, and the columns correspond to several continuous attributes in our dataset. We cannot treat a contingency table with more than 255 rows.

This tutorial is suggested by the presentation of Lebart, Morineau and Piron, in their book, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000. Unfortunately, I don't think there is an English translation of this very good teaching book, which is really popular in France. However, I hope this tutorial is understandable without the book. If you read French, the description of the correspondence analysis is available at section 1.3 (pp. 67-107).

Keywords: analyse factorielle des correspondances, tableau de contingence, khi-2, chi-2, plan factoriel, contributions, cosinus carrés
Components: Correspondence analysis
Tutorial: en_Tanagra_Afc.pdf
Dataset: media_prof_afc.xls
Reference:
L. Lebart, A. Morineau, M. Piron, " Statistique exploratoire multidimensionnelle ", Dunod, 2000.
Statsoft Inc., "Correspondence Analysis".
D. Garson, "Statnotes - Correspondence Analysis".

Thursday, November 6, 2008

Regression trees

The aim of regression analysis is to produce a model that can predict or explain the values of a continuous variable (endogenous) from the values of a list of predictors (exogenous), continuous or discrete.

Multiple linear regression is certainly the most known, but other methods such as Regression Trees can perform this task. But, in this case, the predictive model has the appearance of a tree. We use a "if-then" conditions in order to predict the class value of an individual from its description.

Keywords: regression trees, CART
Components: Regression tree
Tutorial: en_Tanagra_Regression_Tree.pdf
Dataset: housign.arff
Reference:
L. Breiman, J. Friedman, R. Olsen, C. Stone, « Classification and Regression Trees », Wadsworth International, 1984.
Statsoft Inc., "Classification and Regression Trees (C&RT)"

Forward selection for regression analysis

In this tutorial, we show how to use the FORWARD ENTRY REGRESSION component: it performs a multiple linear regression with a forward variable selection based on partial correlation.

We use CRIME_DATASET_FROM_DASL.XL from the DASL website. It contains various characteristics of 47 states of USA. We want to explain the criminality from unemployment, education level, …

Keywords: linear multiple regression, variable selection, forward selection, stepwise, colinearity, partial correlation
Components: View Dataset, Multiple linear regression
Tutorial: en_Tanagra_Forward_Selection_Regression.pdf
Dataset: crime_dataset_from_DASL.xls
References:
Wikipedia - "Stepwise regression"

Multiple linear regression

In this tutorial, we show how to perform a regression analysis with Tanagra.

Our dataset consists in engine cars description. We want to predict “mpg” consumption from cars characteristics such as weight, horsepower, …

Keywords: linear regression, endogenous variable, exogenous variables
Components: View Dataset, Multiple linear regression
Tutorial: enRegression.pdf
Dataset: autompg.bdm
Références:
Wikipedia - "Linear regression"

PLS Regression

PLS (Partial Least Squares Regression) Regression can be viewed as a multivariate regression framework where we want to predict the values of several target variables (Y1, Y2, …) from the values of several input variables (X1, X2, …).

Roughly speaking, the algorithm is the following: “The components of X are used to predict the scores on the Y components, and the predicted Y component scores are used to predict the actual values of the Y variables. In constructing the principal components of X, the PLS algorithm iteratively maximizes the strength of the relation of successive pairs of X and Y component scores by maximizing the covariance of each X-score with the Y variables. This strategy means that while the original X variables may be multicollinear, the X components used to predict Y will be orthogonal”.

The dataset used correspond to 6 orange juices described by 16 physicochemical descriptors and evaluated by 96 judges [Source : Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005). PLS methodology for studying relationships between hedonic judgements and product characteristics. Food Quality an Preference. 16, 4, pp 315-325].

Keywords: pls regression, factorial analysis, multiple linear regression
Components: PLS Regression
Tutorial: en_Tanagra_PLS.pdf
Dataset: orange.bdm
References:
M. Tenenhaus, « La régression PLS – Théorie et pratique », Technip, 1998.S.
H. Abdi, "Partial Least Square Regression".
Garson, « Partial Least Squares Regression (PLS) », http://www2.chass.ncsu.edu/garson/PA765/pls.htm

PLS Regression for Classification Task

PLS (Partial Least Squares Regression) Regression can be viewed as a multivariate regression framework where we want to predict the values of several target variables (Y1, Y2, …) from the values of several input variables (X1, X2, …).

Roughly speaking, the algorithm is the following: “The components of X are used to predict the scores on the Y components, and the predicted Y component scores are used to predict the actual values of the Y variables. In constructing the principal components of X, the PLS algorithm iteratively maximizes the strength of the relation of successive pairs of X and Y component scores by maximizing the covariance of each X-score with the Y variables. This strategy means that while the original X variables may be multicollinear, the X components used to predict Y will be orthogonal”.

The PLS Regression is initially defined for the prediction of continuous target variable. But it seems it can be useful in the supervised learning problem where we want to predict the values of discrete attributes. In this tutorial we propose a few variants of PLS Regression adapted to the prediction of discrete variable. The generic name "PLS-DA" (Partial Least Square Discriminant Analysis) is often used in the literature.

Keywords: pls regression, discriminant analysis, supervised learning
Components: C-PLS, PLS-DA, PLS-LDA
Tutorial: en_Tanagra_PLS_DA.pdf
Dataset: breast-cancer-pls-da.xls
References:
S. Chevallier, D. Bertrand, A. Kohler, P. Courcoux, « Application of PLS-DA in multivariate image analysis », in J. Chemometrics, 20 : 221-229, 2006.
Garson, « Partial Least Squares Regression (PLS) », http://www2.chass.ncsu.edu/garson/PA765/pls.htm

Multinomial logistic regression

In this tutorial, we show how to implement a multinomial logistic regression with TANAGRA.

Logistic regression is a technique for making predictions when the dependent variable is a dichotomy, and the independent variables are continuous and/or discrete. The technique can be modified to handle dependent variable with several (K > 2) levels.

When the responses categories are unordered, we have the multinomial logistic regression. Roughly speaking, we compute the logit function for each (K-1) categories related to a reference group.

Keywords: multinomial logistic regression
Components: Supervised Learning, Multinomial Logistic Regression
Tutorial: en_Tanagra_Multinomial_Logistic_Regression.pdf
Dataset: brand_multinomial_logit_dataset.xls
References:
A. Slavkovic, « Multinomial Logistic Regression Models – Baseline-Category Logit Model », in « STAT 504 – Analysis of Discrete Data », Pensylvania State University, 2007.

Feature selection for logistic regression

In some circumstances, the goal of the supervised learning is not to classify examples but rather to organize them in order to point up the most interesting individuals. For instance, in the direct marketing campaign, we want to detect the customers which are the most likely to respond to the solicitation. In this context, the confusion matrix is not really suitable for the evaluation of the predictive model. It is more valuable to use another tool, more appropriate for the evaluation of the respondents corresponding to the number of reached individuals: this is the “lift curve” (“gain chart”).

In this tutorial, we use the binary logistic regression for the construction of the gain chart. We show also that the variable selection is really useful in the context of dealing with large number of predictive variables.

We use a real/realistic dataset from a website (see Reference below). It contains 2158 examples and 200 predictive attributes. The objective variable is a response variable indicating whether or not a consumer responded to a direct mail campaign for a specific product.

Keywords: scoring, marketing campaign, logistic regression, feature selection, backward, forward, gain chart, lift curve
Components: Supervised learning, Binary logistic regression, Select examples, Scoring, Lift curve, Forward-logit, Backward-logit
Tutorial: en_Tanagra_Variable_Selection_Binary_Logistic_Regression.pdf
Dataset: dataset_scoring_bank.xls
References:
Statistical Society of Canada, "Data Mining - Case Studies - 2000"

STEPDISC - Feature selection for LDA

In this tutorial, we use the stepwise discriminant analysis (STEPDISC) in order to determine relevant variables for a classification task.

STEPDISC (Stepwise Discriminant Analysis) is always associated to discriminant analysis because it relies on the same criterion i.e. the WILKS’ LAMBDA. So it is often presented such as a method especially intended for the discriminant analysis. In effect, it could be useful for various linear models because they are based upon the same representation bias (e.g. logistic regression, linear SVM, etc.). However, it is not really adapted to non-linear model such as nearest neighbor or multi layer perceptron.

We implement the FORAWRD and the BACKWARD strategies in TANAGRA. In the FORWARD approach, at each step, we determine which is the variable that really contributes to the discrimination between the groups. We add this variable if its contribution is significant. The process stops when there is no attribute to add in the model. In the BACKWARD approach, we begin with the complete model with all descriptors. We search which is the less relevant variable. We remove this variable if the removing does not significantly deteriorate the discrimination between groups. The process stops when there is no variable to remove.

Keywords: stepdisc, feature selection, linear discriminant analysis
Components: Supervised Learning, Linear discriminant analysis, Bootstrap, Stepdisc
Tutorial: en_Tanagra_Stepdisc.pdf
Dataset: sonar_for_stepdisc.xls
Reference: SAS/STAT User’s Guide, « The STEPDISC Procedure »

Random forest

RANDOM FOREST is a combination of an ensemble method (BAGGING) and a particular decision tree algorithm (“Random Tree” into TANAGRA).

In this tutorial, we use the HEART (UCI Machine Learning Repository). We aim to predict a heart disease from various descriptors such as the age of the patient, etc. We have already used this dataset in other tutorials (see http://data-mining-tutorials.blogspot.com/search?q=heart).

Keywords: random forest, ensemble methods, decision tree, cross-validation
Components: Bagging, Rnd Tree, Supervised Learning, Cross-validation, C4.5
Tutorial: en_Tanagra_Random_Forest.pdf
Dataset: dr_heart.bdm
Reference: L. Breiman, A. Cutler, « Random Forests ».

SVM using the LIBSVM library

The LIBSVM library contains various support vector algorithms for classification, regression... The implementation is particularly efficient, especially about the processing time, as we will see below. Some documentations are available on the website of the authors.

We have compiled the C source code in a DLL on which we connect TANAGRA. In the first time, only C-SVC, multi-class support vector machine for classification, is available. We will add the other components in the future.

Keywords: SVM, support vector machine, multi-class SVM
Components: Supervised Learning, C-SVC, Bootstrap
Tutorial: en_Tanagra_CSVC_LIBSVM.pdf
Dataset: Tanagra_Nipals.zip
Reference: C.C. Chang, C.J. Jin, « LIBSVM – A library for Support Vector Machine »

Monday, November 3, 2008

Association rule learning using APRIORI PT

In this tutorial, we show how to build association rule on a large dataset using an external program.

Our implementation of A PRIORI is fast but needs a lot of memory that limits its performances when we treat a big dataset or generate numerous rules. I have discovered the Christian BORGELT’s work, he proposes a very powerful association rule generator, which can handle huge dataset and is very fast.

To execute its implementation, we integrated a new approach in TANAGRA: the launching and the control of an external program. At the time of the execution, we create a temporary file, which we transmit to his program (APRIORI.EXE). Then the rules are automatically downloaded and displayed.

Keywords: association rule, large dataset
Components: A priori PT
Tutorial: en_Tanagra_A_Priori_Prefix_Tree.pdf
Dataset: assoc_census.zip
Reference: C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"

"Supervised" Association Rules

In many situations, we want to characterize a subset of the dataset. The GROUP CHARACTERIZATION component allows comparing several subgroups, it computes and compares descriptive statistics on the subsets. But, this component performs univariate analysis. It uses individually the attributes and does not analyze the possible interaction between two or more variables.

In this tutorial, we show a new component SPV ASSOC TREE that allows characterizing a subset of examples with the conjunction of variables. In fact, it is a “supervised like” association rule algorithm where we define the consequent of the rule.

Keywords: association rules, clusters characterization, clustering
Components: Group Characterization, Spv Assoc Tree
Tutorial: en_Tanagra_Spv_Assoc_Tree.pdf
Dataset: vote.txt

Association rule learning from transaction dataset

Association rules can be built from attribute-value dataset, which is re-coded as binary table. In certain cases, we have a transaction dataset, which is already a binary table. It is not necessary to re-code this one. How to handle this kind of dataset?

TANAGRA can handle only attribute-value dataset: the absence of one item in a transaction is coded as 0, other values are seeing as a presence (1 value if the file is correctly encoded).

Keywords: association rules, a priori algorithm
Components: A priori
Tutorial: enBinary_A_Priori.pdf
Dataset: transactions.bdm
References: P.N. Tan, M. Steinbach, V. Kumar, « Introduction to Data Mining », Addison Wesley, 2006 ; chapitre 6, « Association analysis : Basic Concepts and Algorithms ».
Wikipedia - "Association rule learning"

Association rule learning from tabular dataset

In this tutorial, we learn association rules from tabular dataset (examples x attributes).

We evaluate the influence of SUPPORT_MIN and CONFIANCE_MIN parameters on the number and the quality of computed rules.

Keywords: association rules, a priori algorithm
Components: A priori
Tutorial: enA_priori.pdf
Dataset: banque.bdm
References:
P.N. Tan, M. Steinbach, V. Kumar, « Introduction to Data Mining », Addison Wesley, 2006 ; chapitre 6, « Association analysis : Basic Concepts and Algorithms ».
Wikipedia - "Association rule learning"

Decision lists and Decision Trees

Decision Lists have been popular methods in 90’s in machine learning scientific publications. They produce a list of sorted production rules such as “IF condition_1 THEN conclusion_1 ELSE IF condition_2 THEN condition_2 ELSE IF…”.

Decision Lists and Decision Trees have a similar representation bias but not the same learning bias, DL use the “separate-and-conquer” principle instead of “divide-and-conquer” principle. They can produce more specialized rules but they can also lead to overfitting: setting the right learning parameters is very important for the decision lists algorithm.

The algorithm that we have implemented in TANAGRA is suggested by CN2 (Clark & Niblett, ML-1989). We have introduced two main modifications: (1) we use a hill-climbing algorithm instead of a best-first search; (2) a new parameter, minimal support of a rule, can be adjusted to avoid non-significant rules.

Keywords: CN2, decision list, decision tree, CART, discretization
Components: Supervised Learning, MDLPC, Decision List, C-RT, Bootstrap
Tutorial: en_Tanagra_DL.pdf
Dataset: dr_heart.bdm
Reference: P. Clark, « CN2 – Rule induction from examples ».

SVM - Support vector machine

SVM (Support Vector Machine) is a supervised learning algorithm which is well adapted for high dimensional problems. We implement the John Platt's SMO (sequential minimal optimization) algorithm into Tanagra.

In this tutorial, we show how to implement SVM with TANAGRA. We compare the classifier performance with that of the lineardiscriminant analysis (LDA). The error rate is measured using the bootstrap resamppling approach.

Keywords: SVM, support vector machine, machine à vaste marge, analyse discriminante linéaire, fonction noyau
Components: Supervised Learning, SVM, Linear discriminant analysis, Bootstrap
Tutorial: en_Tanagra_SVM.pdf
Dataset: sonar.xls
References: Wikipedia – « Support vector machine »

NIPALS for dimensionality reduction

In this tutorial, we use NIPALS (Non-linear Iterative Partial Least Squares) algorithm for dimensionality reduction in a proteins discrimination problem. The latent variables produced by nipals become the input variables of a nearest neighbor algorithm. The accuracy of the subsequent classifier is dramatically improved.

NIPALS is a possible implementation of singular value decomposition (SVD); it enables to compute factors (latent variable) of principal component analysis (PCA) without a correlation matrix diagonalization. The computing time is reduced especially when we have dataset with many descriptors.

Keywords: NIPALS, principal component analysis, PCA, K-NN, nearest neighbor, bootstrap
Components: Supervised Learning, NIPALS, K-NN, Bootstrap
Tutorial: en_Tanagra_NIPALS.pdf
Dataset: Tanagra_Nipals.zip

Classifier comparison - Cross validation

Comparing the accuracy is often used in order to select the most interesting classifier. To do this, we must therefore produce a reliable error rate estimation.

Most of the time, we use a test set, a part of the dataset that not used during the learning phase. We obtain an unbiased measure of the error rate. But, this strategy is not feasible when we have a small dataset. Reserving a part of the dataset for the classifier evaluation penalizes the learning process.

In the context of small dataset, it is more judicious to use the resampling approaches such as cross validation. In this tutorial, how to implement the cross validation when we compare two classifiers.

Keywords: cross validation, resampling method, classifier comparison, classifier assessment, nearest neighbor, k-nn, decision tree, id3
Components: Supervised Learning, K-NN, Cross-validation
Tutorial: en_dr_comparer_spv_learning.pdf
Dataset: dr_heart.bdm

Classifier comparison - Using a predefined test set

In order to evaluate a supervised learning algorithm, we often split the dataset into training set, which is used in the training process, and test set, which is used to obtain an unbiased error rate evaluation.

There are sampling components in TANAGRA, which enable to subdivide randomly the dataset, but in some circumstances, the user want use a predefined test set for their comparisons. It is especially usefull when we want to compare the performances of classifiers implemented in different softwares.

Keywords: supervised learning, classifier comparison, train and test set, error rate, confusion matrix, linear discriminant analysis, support vector machine, nearest neighbor classifier
Components: Select examples, Supervised learning, Linear discriminant analysis, SVM, K-NN
Tutorial: Classifier comparison
Dataset: sonar_with_test_set.xls

ROC Curve for classifier comparison

ROC graphs enable to compare two or more supervised learning algorithms, they have properties that make them especially useful for domains with skewed class distribution and unequal classification error costs.

An ROC graph depicts relative trade-offs between true positives rate and false positives rate. It needs continuous output of classifier, an estimate of an instance’s class membership probabilities. In fact, a “score”, a numeric value that represents the degree to which an instance is a member of a class is sufficient.

AUC (Area Under Curve) reduces ROC performances to a single scalar value, which enables to compare several classifiers: this area is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

In this tutorial, we compare linear discriminant analysis (LDA) and support vector machine (SVM) on a heart-diseases detection problem.

Keywords: roc curve, roc graphs, auc, area under curve, classifier performance comparison, linear discriminant analysis, svm, support vector machine, scoring
Components: Sampling, 0_1_Binarize, Supervised Learning, Scoring, Roc curve, SVM, Linear discriminant analysis
Tutorial: en_Tanagra_Roc_Curve.pdf
Dataset: dr_heart.bdm
References:
T. Fawcet – « ROC Graphs : Notes and Practical Considerations of Researchers »
Wikipedia - "Receiver operating characteristic"

Sunday, November 2, 2008

Lift Curve - CoIL Challenge 2000

The detection of potential customers is an essential task for data miners. TANAGRA now has new tools to perform this kind of task.

We use the dataset of the CoIL Challenge 2000 (CoIL Challenge 2000): targeting customers which will subscribe a particular insurance policy.

There were 2 datasets: (1) A learning set with 5822 examples. Target attribute is CLASS, there are 85 otherdescriptors, and 43 among them are socio-demographic attributes according of thezip code of the customer. (2) An unlabeled validation set of 4000 examples. We know that there are 238 positive examples in this dataset.

The challenge is to return to the organizers a file with 800 examples that contains the mostpositive customers.

Keywords: scoring, ciblage marketing, analyse discriminante, courbe lift, gain chart
Components: Supervised learning, Linear discriminant analysis, Select examples, Scoring, Lift curve
Tutorial: en_Tanagra_Scoring.pdf
Dataset: tcidata.zip

Apply a classifier on a new dataset (Deployment)

How to apply classifier on a new dataset?

This functionality surpasses the TANAGRA framework, which intends only to evaluate and compare data mining algorithms. But, users ask it often; in this tutorial we show how to proceed.

Data preparation is a primordial step. Indeed, TANAGRA can handle only one data source. It is not theoretically possible to manipulate two dataset, and therefore apply a classifier on a new dataset. The trick is in the dataset preparation.

Keywords: deployment, CART algorithm, dataset exportation
Components: Supervised learning, C-RT, Select examples, View dataset, Export dataset
Tutorial: en_Tanagra_Deployment.pdf
Dataset: tanagra_deployment_files.zip

Feature selection using MIFS algorithm

The variable selection is a crucial step of the Data Mining process. In a supervised learning context, the detection of the relevant variables is overriding. Furthermore, according the Occam Razor principle, we need always to build the simplest model.

This tutorial describes the implementation of the component MIFS (Battiti, 1994) in a naive bayes learning context. It is also interesting because the selection phase is preceded by a feature transformation step where continuous descriptors are discretized using the MDLPC algorithm (Fayyad and Irani, 1992).

Keywords: sélection de variables, discrétisation, modèle d’indépendance conditionnelle
Components: Supervised learning, Naive Bayes, MDLPC, MIFS filtering, Cross validation
Tutorial: enFeature_Selection_For_Naive_Bayes.pdf
Dataset: iris.bdm
References:
R. Battiti, « Using the mutual information for selecting in supervised neural net learning », IEEE Transactions on Neural Networks, 5, pp.537-550, 1994.
U. Fayyad et K. Irani, « Multi-interval discretization of continuous-valued attributes for classification learning », in Proc. of IJCAI, pp.1022-1027, 1993.

Discretization and Naive Bayes Classifier

Build a naive bayes classifier on continuous descriptors. TANAGRA implementation of naive bayes classifier handles only discrete attributes, we needto discretize continuous descriptors before use them.

Because we are in a supervised learning context, we must use a superviseddiscretization algorithm such as Fayyad and Irani’s state-of-the-art MDLPC algorithm.

Keywords: contextula discretization, naive bayes classifier, cross-validation
Components: MDLPC, Supervised Learning, Naive bayes, Cross-validation
Tutorial: enSupervisedDiscretisation.pdf
Dataset: breast.bdm
References:
U. Fayyad et K. Irani, « Multi-interval discretization of continuous-valued attributes for classification learning », in Proc. of IJCAI, pp.1022-1027, 1993.

Decision Tree - ID3 algorithm

This tutorial shows how to implement the ID3 induction tree algorithm (supervised learning) on a dataset. We analyze the famous "breast cancer wisconsin" dataset.

Keywords: decision tree, classification tree, ID3, supervised learning
Components: Supervised Learning, ID3
Tutorial: enDecisionTree.pdf
Dataset: breast.bdm
References:
R. Quinlan, " Induction of Decision Trees ", Machine Learning, 1, 81-106, 1986.

Saving and loading a sub-diagram

We can save a part of the stream diagram. The goal is to perform some sequence of treatments on several similar datasets.

Keywords: save a diagram, copy paste, classifier comparison, supervised learning, cross validation, naive bayes classifier, feature selection, fcbf
Components: Supervised learning, Naive Bayes, Cross validation
Tutorial: en_Tanagra_Diagram_Save_Subdiagram.pdf
Dataset: congressvote_zoo.zip

Saturday, November 1, 2008

Multilayer perceptron - Software comparison

A Multilayer Perceptron for a classification task (neural network): comparison of TANAGRA, SIPINA and WEKA.

When we want to train a neural network, we have to follow these steps:
· Import the dataset;
· Select the discrete target attribute and the continuous input attributes;
· Split the dataset into learning and test set;
· Choose and parameterize the learning algorithm;
· Execute the learning process;
· Evaluate the performance of the model on the test set.

Keywords: neural network, multilayer perceptron, classifier assessment, sipina, weka
Components: Sampling, Supervised learning, Log-Reg TRIRLS, Linear Discriminant Analysis, C-SVC, Test
Tutorial: en_Tanagra_TSW_MLP.pdf
Dataset: ionosphere.arff
References:
Wikipedia - "Neural network"

Association rule learning - Software comparison

Computing association rule with TANAGRA, ORANGE and WEKA.

We must respect the following steps if we want to compute association rules from a dataset:
• Import the dataset;
• Select the descriptors;
• Set the parameters of the association rule algorithm i.e. the minimal support and the minimal confidence;
• Execute the algorithm and visualize the rules.

Our three packages use attribute-based dataset. Each attribute-value couple becomes an item which be used for generating rules.

Keywords: association rules, a priori
Components: A priori
Tutorial: en_Tanagra_TOW_Association_Rule.pdf
Dataset: vote.txt
References:
Wikipedia - "Association rule learning"

Interactive tree builder

One of the main advantages of decision trees is the possibility, for the users, to interactively build the prediction model. In this tutorial, we show how, using SIPINA and ORANGE, we build and manually modify a decision tree; especially we select the split attribute and
pruning the tree.

SIPINA is one of my old projects. It was very useful but it had some limitations, which have been rectified in TANAGRA: it was intended only for supervised learning; we cannot define and save the sequences of treatments in a diagram. Nevertheless, I use still this version for my courses, in particular for its functionalities in the interactive construction of decision trees.

SIPINA uses a graphical representation of the tree; ORANGE uses a standard treeview components. We will see that they propose very similar functionalities and provide the same results.

Keywords: decision tree, classification tree, interactive analysis, classififer assessment, orange
Tutorial: en_Tanagra_Interactive_Tree_Builder.pdf
Dataset: iris_tree.txt
References:
Orange - "Interactive tree builder"