Tanagra - Data Mining and Data Science Tutorials: 2009

Thursday, December 24, 2009

VARIMAX rotation in Principal Component Analysis

A VARIMAX rotation is a change of coordinates used in principal component analysis (PCA) that maximizes the sum of the variances of the squared loadings. Thus, all the coefficients (squared correlation with factors) will be either large or near zero, with few intermediate values.

The goal is to associate each variable to at most one factor. The interpretation of the results of the PCA will be simplified. Then each variable will be associated to one and one only factor, they are split (as much as possible) into disjoint sets.

In this tutorial, we show how to perform this kind of rotation from the results of a standard PCA in Tanagra.

Keywords: PCA, principal component analysis, VARIMAX, QUARTIMAX
Components : Principal Component Analysis, Factor Rotation
Tutorial: en_Tanagra_Pca_Varimax.pdf
Dataset: crime_dataset_from_DASL.xls
References:
Tanagra, "New features for PCA in Tanagra"
Tanagra, "Principal Component Analysis (PCA)"
Wikipedia, "Varimax rotation"
H. Abdi, "Factor rotations in Factor Analyses"

Sunday, December 20, 2009

Kruskal–Wallis one-way analysis of variance

The tests for comparison of population try to determine if K (K 2) samples come from the same underlying population according to a dependent variable (X). In other words, we try to determine if the underlying distribution of X is the same whatever the group.

We talk about non parametric tests when we do not make assumption about the shape of the distribution of the dependent variable. They are considered as being "distribution free" methods, at the opposite of the parametric approaches.

In this tutorial, we implement various tests for differences in location. The Kruskal-Wallis test is certainly the most used one when we try to determine if the scores among groups are stochastically the same. But other tests exist. We compare the results obtained. We will complete the analysis by conducting multiple comparisons in order to identify groups that differ significantly from each other.

Keywords: non parametric test, independent samples, Kruskal-Wallis, Van der Waerden, Fisher-Yates-Terry-Hoeffding, median test, tests for differences in location
Components: KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, FYTH 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_KW_and_related.pdf
Dataset: wine_evaluation_nonparametric.xls
References:
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 14a. The Kruskal-Wallis Test for 3 or More Independent Samples.
Wikipedia. Kruskal–Wallis one-way analysis of variance.

Thursday, December 17, 2009

Tests for differences in scale

Parametric and non parametric tests for differences in scale.

The tests of equal variability (or dispersion, or scale, or simply variance) are often presented as a preliminary test before the comparison of means, in order to verify the homoscedasticity assumption. But this is not their only purpose. Compare dispersions can be an end in itself. For example, we wish to compare the performance of two systems of heating. The average temperature at the center of the room is the same; however one can wish to compare the mode of diffusion of heat in different parts of the room.

The parametric tests are based primarily on the Gaussian distribution. The test becomes a test for homogeneity of variance. We highlight the Levene test in this tutorial. Other tests exist (Bartlett test for instance), we mention them in this tutorial.

When the normality assumption is questionable, when sample size is low, when the variable is ordinal and not continuous, it is more appropriate to use non parametric tests. These are called tests for equality of scales or dispersions. In fact the procedures are not based on estimated variances. We will use well known techniques such as the Ansari-Bradley test, the Mood or the Klotz test. They have a scope broader since nonparametric. Some of these tests have a drawback, they are not applicable when the distributions conditionals do not share the same parameter of central tendency (the median in general, but we can adjust the values by centering in relation to the median).

In this tutorial, we show how to implement these various tests with Tanagra.

Keywords: parametric test, non parametric test, independent samples, Levene test, Bartlett test, Brown-Forsythe test, Mood test, Klotz test, Ansari-Bradley test
Components: LEVENE’S TEST, ANSARI-BRADLEY SCALE TEST, MOOD SCALE TEST, KLOTZ SCALE TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Scale_Differences.pdf
Dataset: tests_for_scale_differences.xls
References:
NIST, "Quantitative techniques", section 1.3.5 - http://www.itl.nist.gov/div898/handbook/eda/section3/eda35.htm

Wednesday, December 9, 2009

Outliers and influential points in regression

The analysis of outliers and influential points is an important step of the regression diagnostics. The goal is to detect (1) the points which are very different to the others (outliers) i.e. they seem do not belong to the analyzed population; or (2) the points that if they are removed (influential points), leads us to a different model. The distinction between these kinds of points is not always obvious.

In this tutorial, we implement several indicators for the analysis of outliers and influential points. To avoid confusion about the definitions of indicators (some indicators are calculated differently from one tool to another), we compare our results with state-of-the-art tool such as SAS and R. In a first step, we give the results described into the SAS documentation. In a second step, we describe the process and the results under Tanagra and R. In conclusion, we note that these tools give the same results.

Keywords: linear regression, outliers, influential points, standardized residuals, studentized residuals, leverage, dffits, cook's distance, covratio, dfbetas, R software
Components: Multiple linear regression, Outlier detection, DfBetas
Tutorial: en_Tanagra_Outlier_Influential_Points_for_Regression.pdf
Dataset: USPopulation.xls
References:
SAS STAT User’s Guide, « The REG Procedure – Predicted and Residual Values »

Monday, December 7, 2009

Tests for comparing two related samples

Dependent samples, also called related samples or correlated samples, occur when the response of the nth person in the second sample is partly a function of the response of the nth person in the first sample. There are several common forms of sample dependency . (1) Before-after and other studies in which the same people are surveyed at different points in time, including panel studies. (2) Matched-pairs studies in which each of the subjects of the study is paired with each of those in a comparison group on the basis matching factors (e.g. age, sex, income, etc.). (3) The pairs can simply be inherent in the situation we are trying to analyze. For instance, one tries to compare the time spent watching television by the man and woman within a couple. The blocks are naturally households. Men and women should not be considered as independent observations.

The aim of tests for related samples is to exclude from the analysis the within-group variation. The calculation of the differences is realized within each pair of subjects. In this tutorial, we show how to implement 3 tests for two related samples. Two of them are non-parametric (sign test and Wilcoxon matched-pairs ranks test), the last one is the parametric t-test for related samples.

Keywords: parametric test, non-parametric test, paired samples, sign test, wilcoxon signed rank test, paired samples t-test, normality test
Components: SIGN TEST, WILCOXON SIGNED RANK TEST, PAIRED T-TEST, FORMULA, NORMALITY TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Two_Related_Samples.pdf
Dataset : comparison_2_related_samples.xls
References :
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 12a. The Wilcoxon Signed-Rank Test.

Wednesday, December 2, 2009

Multivariate tests for comparing populations

Multivariate parametric hypothesis testing for comparing populations.

A multivariate test for comparison of population try to determine if K (K 2) samples come from the same underlying population according to a set of variables of interest (X1,…,Xp).

We talk about parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the data is drawn from a multivariate Gaussian distribution, the hypothesis testing relies on mean vector or on covariance matrix.

Keywords: Hotelling's T2, Wilks' Lambda, Box’s M test, Bartlett's test, mean vector, covariance matrix, MANOVA
Components: UNIVARIATE CONTINUOUS STAT, HOTELLING’S T2, HOTELLING’S T2 HETEROSCEDASTIC, BOX’S M TEST, ONE-WAY MANOVA
Lien: en_Tanagra_Multivariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References :
S. Rathburn, A. Wiesner, "STAT 505: Applied Multivariate Statistical Analysis", The Pennsylvania State University.

Monday, November 30, 2009

Parametric tests for comparing populations

Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples.

The tests for comparison of population try to determine if K (K >= 2) samples come from the same underlying population according to a variable of interest (X). We talk parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the distribution of the data is Gaussian, the hypothesis testing relies on mean or on variance.

We handle univariate test in this tutorial i.e. we have only one variable of interest. When we want to analyze simultaneously several variables, we talk about multivariate test.

Keywords: t-test, F-Test, Bartlett's test, Levene's test, Brown-Forsythe's test, independent samples, dependent samples, paired samples, matched-pairs samples, anova, welch's anova, randomized complete blocks
Components: MORE UNIVARIATE CONT STAT, NORMALITY TEST, T-TEST, T-TEST UNEQUAL VARIANCE, ONE-WAY ANOVA, WELCH ANOVA, FISHER’S TEST, BARTLETT’S TEST, LEVENE’S TEST, BROWN-FORSYTHE TEST, PAIRED T-TEST, PAIRED V-TEST, ANOVA RANDOMIZED BLOCKS
Tutorial: en_Tanagra_Univariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References:
NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ (Chapter 7, Product and Process Comparisons)

Thursday, November 26, 2009

Three curves for classifier assessment

Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier. On one hand we have the confusion matrix and associated indicators, very popular into the academic publications. On the other hand, in the real applications, the users prefers some curves which seem very mysterious for people outside the domain (e.g. ROC curve for the epidemiologists, gain chart or cumulative lift curve in the marketing domain, precision recall curve in the information retrieval domain, etc.).

In this tutorial, we give first the details of the calculation of these curves by creating them "at the hand" in a spreadsheet. Then, we use Tanagra 1.4.33 and R 2.9.2 for obtaining them. We use these curves for the comparison the performances of the logistic regression and support vector machine (Radial Basis Function kernel).

Keywords: roc curve, gain chart, precision recall curve, lift curve, logistic regression, support vector machine, svm, radial basis function kernel, rbf kernel, e1071 package, R software, glm

Components: DISCRETE SELECT EXAMPLES, BINARY LOGISTIC REGRESSION, SCORING, C-SVC, ROC CURVE, LIFT CURVE, PRECISION-RECALL CURVE
Tutorial: en_Tanagra_Spv_Learning_Curves.pdf
Dataset : heart_disease_for_curves.zip

Sunday, November 22, 2009

Tanagra - Version 1.4.34

A component of induction of predictive rules (RULE INDUCTION) was added under "Supervised Learning" tab. Its use is described in a tutorial available online (will be translated soon).

The DECISION LIST component has been improved, we changed the test done during the pre-pruning process. The formula is described in the tutorial above.

The SAMPLING and STRATIFIED SAMPLING components (Instance Selection tab) have been slightly modified. It is now possible to set ourself the seed number of the pseudorandom number generator.

Following an indication of Anne Viallefont, calculation of degrees of freedom in tests on contingency tables is now more generic. Indeed, the calculation was wrong when the database was filtered and some margins (row or column) contained a number equal to zero. Anne, thank you for this information. More generally, thank you to everyone who sent me comments. Programming has always been for me a kind of leisure. The real work starts when it is necessary to check the results, compare them with the available references, cross them with other data mining tools, free or not, understand the possible differences, etc.. At this step, your help is really valuable.

Monday, November 9, 2009

Handling Missing values in SIPINA

Dealing with missing values is a difficult problem. The programming in itself is not a problem; we just report the missing value by a specific code. In contrast, the treatment before or during data analysis is very complicated.

Various techniques are available in order to handle missing values into SIPINA. In this tutorial, we show how to implement them; and what are their consequences on the decision tree learning context (C4.5 algorithm; Quinlan, 1993).

Keywords: missing value, missing data, listwise deletion, casewise deletion, data imputation, C4.5, decision tree
Tutorial: en_Sipina_Missing_Data.pdf
Dataset: ronflement_missing_data.zip
References:
P.D. Allison, « Missing Data », in Quantitative Applications in the Social Sciences Series n°136, Sage University Paper, 2002.
J. Bernier, D. Haziza, K. Nobrega, P. Whitridge, « Handling Missing Data – Case Study », Statistical Society of Canada.
D. Garson, "Data Imputation for Missing Values"

Wednesday, November 4, 2009

Model deployment with Sipina

Model deployment is the last step of the Data Mining process. In its simplest form in a supervised learning task, it consists in to apply a predictive model on unlabeled cases.

Applying the model on unseen cases is a very useful functionality. But it would be even more interesting if we could announce its accuracy. Indeed, a misclassification can have dramatic consequences. We must measure the risk we take when we make decisions from a predictive model. An indication about the performance of a classifier is important when we decide or not to deploy it.

In this tutorial, we show how to apply a classifier on unlabeled sample with Sipina. We show also how to estimate the generalization error rate using a resampling scheme such as bootstrap.

Keywords: model deployment, unseen cases, unlabeled instances, decision tree, sipina, linear discriminant analysis

Tutorial: en_sipina_deployment.pdf

Dataset: wine_deployment.xls

References:

Tanagra Tutorials, "Applying a classifier on a new dataset (Deployment)"

Tuesday, November 3, 2009

Sipina - Supported file format

The data access is the first step of the data mining process. It is a crucial step. It is one of the main criteria used when we want to assess the quality of a tool. If we do not able to load a dataset, we cannot perform any kind of analysis. The software is not useable. If the data access is not easy and requires complicated operations, we will devote less time to the other steps of the data exploration.

The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with 4,817,099 instances and 42 variables.

Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.

Keywords: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm
Tutorial: en_Sipina_File_Format.pdf
Dataset: weather.txt and kdd-cup-discretized-descriptors.txt.zip

Saturday, October 31, 2009

Importing Weka file (.arff) into Sipina

WEKA is a very popular Data Mining tool. It supplies a very large of machine learning methods. WEKA can handle various files. But it has a native format (.ARFF) which is a text file with additional specifications.

The text file format is very simple and very easy to manipulate. But, on the other hand, the processing of this kind of file is often slow, slower than binary file format. When we deal with a moderate size file, the text file is enough efficient. The differences between the time processing are not discernible.

In this tutorial, we show how to import the ARFF file format into Sipina. We subdivide the dataset into train and test samples. Then we learn and we assess a decision tree.

Keywords: decision tree, c4.5, file format, data file importation, weka, arff
Tutorial: en_sipina_weka_file_format.pdf
Dataset: ionosphere.arff
References:
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I. Witten, "The Weka Data Mining Software: An Update", SIGKDD Explorations, Vol. 11, Issue 1, 2009.

Wednesday, October 28, 2009

Local sampling for decision tree learning

During the decision tree learning process, the algorithm detects the better variable according to a goodness of fit measure when it tries to split a node. The calculation can take a long time, particularly when it deals with a continuous descriptors for which it must detect the optimal cut point.

For all the decision tree algorithms, Sipina can use a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes.

We have described this approach in a paper (Chauchat and Rakotomalala, IFCS-2000) . We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node enables to reduce dramatically the execution time without loss of accuracy.

We use a version of the WAVEFORM dataset with 21 continuous descriptors and 2,000,000 instances. We obtain the tree in 3 seconds on our computer.

Keywords : decision tree, sampling, large dataset
Components : SAMPLING, ID3, TEST
Tutorial : en_Sipina_Sampling.pdf
Dataset : wave2M.zip
Références :
J.H. Chauchat, R. Rakotomalala, « A new sampling strategy for building decision trees from large databases », Proc. of IFCS-2000, pp. 199-204, 2000.

Saturday, October 3, 2009

Tanagra - Version 1.4.33

Several logistic regression diagnostics and evaluation tools were implemented, one of them (reliability diagram) can be applied to any supervised method

1.The estimated covariance matrix
2. Hosmer - Lemeshow Test
3. Reliability diagram (says also calibration plot)
4. Analysis of residuals, outilers and influentials points (pearson residuals, deviance residuals, dfichisq, difdev, levier, Cook's distance, dfbeta, dfbetas)

A tutorial describing the utilization of these tools will be available soon.

Monday, September 28, 2009

Using batch mode for Tanagra

For large simulations, it is more convenient to use BATCH mode capabilities of Tanagra rather than opening interactive session. This is the case for instance when we compare the performance of various algorithms on the same dataset; when we try to find automatically the best parameters for a learning method; when we repeat the same treatment on different datasets, etc. In these contexts, it is more useful to save the diagrams in text mode (.TDM file format). It will be easier to handle it outside TANAGRA, with a text editor for instance.

In this tutorial, we want to compare the performances of the naïve bayes classifier with and without the feature selection process. We know that the naïve bayes classifier is highly sensitive to irrelevant features. The goal of this tutorial is to evaluate the efficiency of the FCBF feature selection method in this context.

Keywords: batch mode, supervised learning, naive bayes, feature selection, experiments
Components: NAIVE BAYES, FCBF, CROSS VALIDATION
Tutorial: english_dr_utiliser_tanagra_en_mode_batch.pdf
Dataset: tanagra_batch_execution.zip

Wednesday, July 15, 2009

Nonparametric tests for groups comparison - Independent samples - Differences in location

The aim of homogeneity test (or test for difference between groups) is to check if K (K >= 2) samples are drawn from the same population according to a variable of interest. In another words, we check if the probability distribution is the same in each sample.

The nonparametric tests make no assumptions about the distribution of the data. They are called also "distribution free" tests.

In this tutorial, we show how to implement nonparametric homogeneity tests for differences in location for K = 2 populations i.e. the distributions of the populations are the same excepting a shift in location (central tendency). The Kolmogorov-Smirnov test is the more general one. It checks all kind of differences between the cumulative distribution functions (CDF). Afterwards, we can implement other tests which characterize more deeply the difference. The Wilcoxon-Mann-Whitney test is certainly the most popular one. We will see in this tutorial that other tests can be also implemented.

Some the tests introduced here are usable when the number of groups is upper than 2 (K > 2).

Keywords: nonparametric test, Kolmogorov-Smirnov test, Wilcoxon-Mann-Whitney test, Van der Waerden test, Fisher-Yates-Terry-Hoeffding test, median test, location model
Components: FYTH 1-WAY ANOVA, K-S 2-SAMPLE TEST, MANN-WHITNEY COMPARISON, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_MW_and_related.pdf
Dataset: machine_packs_cartons.xls
References:
R. Rakotomalala, « Comparaison de populations. Tests non paramétriques », Université Lyon 2 (in french).
Wikipedia, « Non-parametric statistics ».

Thursday, July 9, 2009

Resampling methods for error estimation

The ability to predict correctly is one of the most important criteria to evaluate classifiers in supervised learning. The preferred indicator is the error rate (1 - accuracy rate). It states the probability of misclassification of a classifier. In most cases we do not know the true error rate because we do not have the whole population and we do not know the probability distribution of the data. So we need to compute estimation from the available dataset.

In the small sample context, it is preferable to implement the resampling approaches for error rate estimation. In this tutorial, we study the behavior of the cross validation (cv), leave one out (lvo) and bootstrap (boot). All of them are based on the repeated train-test process, but in different configurations. We keep in mind that the aim is to evaluate the error rate of the classifier created on the whole sample. Thus, the intermediate classifiers computed on each learning session are not really interesting. This is the reason for which they are rarely provided by the data mining tools.

The main supervised learning method used is the linear discriminant analysis (LDA). We will see at the end of this tutorial that the behavior observed for this learning approach is not the same if we use another approach such as a decision tree learner (C4.5).

Keywords: resampling, generalization error rate, cross validation, bootstrap, leave one out, linear discriminant analysis, C4.5
Components: Supervised Learning, Cross-validation, Bootstrap, Test, Leave-one-out, Linear discriminant analysis, C4.5
Tutorial: en_Tanagra_Resampling_Error_Estimation.pdf
Dataset: wave_ab_err_rate.zip
Reference:
"What are cross validation and bootstrapping?"

Sunday, July 5, 2009

Implementing SVM on large dataset

Support vector machines (SVM) are a set of related supervised learning methods used for classification and regression. Our aim is to compare various free implementation of SVM, in terms of accuracy and computation time. Indeed, because the heuristic nature of the algorithm, we can obtain different results according to the used tools on the same dataset. In fact, in the publications describing the performance of SVM, we should not only specify the parameters of the algorithm but also indicate what is the tool used. This latter can influence the results.

SVM is effective in domains with very high number of predictive variables, when the ratio between the number of variables and the number of observations is unfavorable. We are in a domain which is particularly favorable to SVM in this tutorial. We want to discriminate two families of proteins from their description with amino acids. We use sequence of 4 characters (4-grams) as descriptors. Thus, we can have a large number of descriptors (31,809) in comparison to the number of examples (135 instances).

We compare Tanagra 1.4.27, Orange 1.0b2, Rapidminer Community Edition 4.2 and Weka 3.5.6.

Keywords: svm, support vector machine
Components: C-SVC, SVM, SUPERVISED LEARNING, CROSS-VALIDATION
Tutorial: en_Tanagra_Perfs_Comp_SVM.pdf
Dataset: wide_protein_classification.zip
Reference:
Wikipedia (en), « Support vector machine »

Wednesday, July 1, 2009

Self-organizing map (SOM)

A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.

In this tutorial, we show how to implement the Kohonen's SOM algorithm with Tanagra. We try to assess the properties of this approach by comparing the results with those of the PCA algorithm. Then, we compare the results to those of K-Means, which is a clustering algorithm. Finally, we implement the Two-step Clustering process by combining the SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering where we combine K-Means and HAC. We observe that the HAC primarily merges the adjacent cells.

Keywords: Kohonen, self organizing map, SOM, clustering, dimensuionality reduction, k-means, hierarchical agglomerative clustering, hac, two-step clustering
Components: UNIVARIATE CONTINUOUS STAT, UNIVARIATE OUTLIER DETECTION, KOHONEN-SOM, PRINCIPAL COMPONENT ANALYSIS, SCATTERPLOT, K-MEANS, CONTINGENCY CHI-SQUARE, HAC
Tutorial: en_Tanagra_Kohonen_SOM.pdf
Dataset: waveform_unsupervised.xls
Reference:
Wikipedia, « Self organizing map », http://en.wikipedia.org/wiki/Self-organizing_map

Monday, June 29, 2009

Univariate outlier detection methods

The detection and the treatment of outliers (individuals with unusual values) is an important task of data preparation. Unusual values can mislead results of subsequent data analysis.

Outliers can be detected on one variable (a man with 158 years old) or on a combination of variables (a boy with 12 years old crosses the 100 yards in 10 seconds). In this tutorial, we show how to use the UNIVARIATE OUTLIER DETECTION component. It is intended to univariate detection of outliers i.e. taking into account individually the variables.

The approaches implemented in the component come from the NIST website (see reference). We use also an additional rule based on the x-sigma deviation from the mean of the variable.

Keywords: outlier, influential point
Components: MORE UNIVARIATE CONT STAT, SCATTERPLOT WITH LABEL, UNIVARIATE OUTLIER DETECTION, UNIVARIATE CONT STAT
Tutorial: en_Tanagra_Outliers_Detection.pdf
Dataset: body_mass_index.xls
References:
NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.1.6, « What are outliers in the data ? »
R. High, "Dealing with 'Outliers': How to Maintain Your Data's Integrity"

Copy paste feature into the diagram

When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the diagram. This feature is very helpful when we have to repeat sequences of treatments in different parts of the diagram. The settings are also duplicated.

In this tutorial, we show how to copy a component or a branch. We will see that this feature is helpful when, for instance, we deal with the performance comparisons of supervised learning algorithms on the same dataset. In this context, the processing sequence is always the same, only the method that we want to evaluate is different.

We work on the same project here. We cannot copy paste components between two opened projects. But, in another tutorial, we show how to save a part of the diagram in an external file. Thus, the same processing sequence can be applied on multiple datasets.

Keywords: copy paste, diagram management, comparison of classifiers, supervised learning, cross validation, dimensionality reduction
Components: Supervised learning, Binary logistic regression, C-PLS, C-SVC, Linear discriminant analysis, K-NN, Principal Component Analysis
Tutorial: en_Tanagra_Diagram_New_Features.pdf
Dataset: sonar.xls

Saturday, June 27, 2009

The A PRIORI MR component

Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain e.g. if a customer buys onions and potatoes then he buys also beef. But, in fact, it can be implemented in various application areas where we want to discover the association between variables.

We were already described the association rule mining tools of Tanagra in several tutorials. The A PRIORI approach is certainly the most popular. But, despite its good properties, this method has a drawback: the number of obtained rules can be very high. The ability to underline the most interesting rules, those which are relevant, becomes a major challenge.

In this tutorial, we show to implement the A PRIORI MR component. It differentiates oneself from other by offering additional tools for exploring and assessing the mined rules: original measures based on the “test value” principle allow to evaluate differently the rules; the ability to copy the results into a spreadsheet allows a more detailed exploration of the rule base; by subdividing the dataset into train and test sets, we obtain a more reliable values of the interestingness measures of rules.

Keywords: association rule, a priori algorithm, interestingness measure, test value principle
Components: A PRIORI MR
Tutorial: en_Tanagra_APrioriMR_Component.pdf
Dataset: credit_assoc.xls
Reference:
Wikipedia, "Association rule learning"

Sunday, June 14, 2009

Two-step clustering for handling large databases

The aim of the clustering is to identify homogenous subgroups of instance in a population. In this tutorial, we implement a two-step clustering algorithm which is well-suited when we deal with a large dataset. It combines the ability of the K-Means clustering to handle a very large dataset, and the ability of the Hierarchical clustering (HCA – Hierarchical Cluster Analysis) to give a visual presentation of the results called “dendrogram”. This one describes the clustering process, starting from unrefined clusters, until the whole dataset belongs to one cluster. It is especially helpful when we want to detect the appropriate number of clusters.

The implementation of the two-step clustering (called also “Hybrid Clustering”) under Tanagra is already described elsewhere. According to the Lebart and al. (2000) recommendation , we perform the clustering algorithm on the latent variables supplied by a PCA (Principal Component Analysis) computed from the original variables. This pre-treatment cleans the dataset by removing the irrelevant information such as noise, etc. In this tutorial, we show the efficiency of the approach on a large dataset with 500,000 observations and 68 variables. We use Tanagra 1.4.27 and R 2.7.2 which are the only tools which allow to implement easily the whole process.

Keywords: clustering, hierarchical cluster analysis, HCA, k-means, principal component analysis, PCA
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, HAC, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_CAH_Mixte_Gros_Volumes.pdf
Dataset: sample-census.zip
References:
L. Lebart, A. Morineau, M. Piron, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000 ; chapter 2, sections 2.3 et 2.4.
D. Garson, "Cluster Analysis" from North Carolina State University.

Thursday, June 11, 2009

K-Means - Comparison of free tools

K-means is a clustering (unsupervised learning) algorithm. The aim is to create homogeneous subgroups of examples. The individuals in the same subgroup are similar; the individuals in different subgroups are as different as possible.

The K-Means approach is already described in several tutorials (http://data-mining-tutorials.blogspot.com/search?q=k-means). The goal here is to compare its implementation with various free tools. We study the following tools: Tanagra 1.4.28; R 2.7.2 without additional package; Knime 1.3.5; Orange 1.0b2 and RapidMiner Community Edition.

Keywords: clustering, k-means, PCA, principal component analysis, MDS,multidimensional scaling
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_et_les_autres_KMeans.pdf
Dataset: cars_dataset.zip
Reference:
D. Garson, "Cluster Analysis"

Saturday, May 30, 2009

Understanding the "test value" criterion

The test value (VT) is a criterion often used in various components of TANAGRA. It is mainly used for the characterization of a group of observations according a continuous or categorical variable. The groups may be defined by categories from a discrete variable; they can also be computed by a machine learning algorithm (e.g. clustering, a node of a decision tree, etc.).

The principle is elementary: we compare the values of a descriptive statistic indicator computed on the whole sample and computed on sub sample related to the group. For a continuous variable, we compare the mean; for a discrete one, we compare the proportion.

Despite, or because of its simplicity, the VT is very useful. The formulation that we present in this tutorial is taken from the Lebart et al.’s book (2001). The VT is intensively used in some commercial software such as SPAD (http://eng.spad.eu/). It allows to characterize groups, but it can be used also to strengthen the interpretation of the factors extracted from a factorial analysis process.

In this tutorial, we emphasis the formulas used for both categorical and continuous variables. We put them in connection with the results provided by the GROUP CHARACTERIZATION component of TANAGRA.

Keywords: test value, group characterization, clustering, factorial analysis
Components: Group characterization
Tutorial: en_Tanagra_Comprendre_La_Valeur_Test.pdf
Dataset: heart_disease_male.xls
Reference:
L. Lebart, A. Morineau, M. Piron, « Statistique exploratoire multidimensionnelle », Dunod, 2000 ; pages 181 to 184.

Friday, May 29, 2009

Descriptive statistics (continued)

The aim of descriptive statistics is to describe the main features of a collection of data in quantitative terms . The visualization of the whole data table is seldom useful. It is preferable to summarize the characteristics of the data with some selected numerical indicators.

In this tutorial, we distinguish two kinds of descriptive approaches: the univariate tools which summarize the characteristics of a variable individually; the bivariate tools which characterize the association between two variables. According to the type of the variables (categorical or continuous), we use different indicators.

Keywords: descriptive statistics
Components: UNIVARIATE DISCRETE STAT, CONTINGENCY CHI-SQUARE, UNIVARIATE CONTINUOUS STAT, SCATTERPLT, LINEAR CORRELATION, GROUP CHARACTERIZATION
Tutorial: en_Tanagra_Descriptive_Statistics.pdf
Dataset: enquete_satisfaction_femmes_1953.xls
References:
Tanagra Tutorials, "Descriptive statistics"

Friday, May 1, 2009

ID3 on a large dataset

In the data mining domain, the increasing size of the dataset is one of the major challenges in the recent years. The ability to handle large data sets is an important criterion to distinguish between research and commercial software.

Commercial tools have often a very efficient data management systems, limiting the amount of data loaded into memory at each step of the treatment. Research tools, at the opposite, keep all data in memory. The limits are clearly the memory capacity of the machine in this context. It is certainly a drawback for the treatment of large files. We note however that, nowadays, we can have very powerful computers at least cost, this drawback is always postponed. With an appropriate encoding strategy, we can fit in memory all the dataset, even if we handle a large data file.

In this tutorial, we show how to import a file with 581,012 observations and 55 variables, and then how to build a decision tree with the ID3 method. In relation to other decision tree algorithm such as C4.5 or CART, the determination of the right size of the tree is based on a pre-pruning rule. We will see that the computation is fast because of this characteristic.

Keywords: large dataset, decision tree algorithm, ID3
Components: ID3, SPV LEARNING
Tutorial: en_Tanagra_Big_Dataset.pdf
Dataset: covtype.zip
References:
Tanagra tutorials, "Performance comparison under Linux"
Tanagra Tutorials, "Decision tree and large dataset"

Thursday, April 30, 2009

Principal Component Analysis (PCA)

The PCA belongs to the factor analysis approaches. It is used to discover the underlying structure of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors (dimensions) and as such is a "non-dependent" procedure i.e. it does not assume a dependent variable is specified.

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. We use the AUTOS_ACP.XLS dataset from the state-of-the-art SAPORTA’s book. The interest of this dataset is that we can compare our results with those described in the book (pages 177 to 181). We simply show the sequence of operations and the reading of the results tables in this tutorial. About the detailed interpretation, it is best to refer to the book.

Keywords: factor analysis, principal component analysis, correlation circle
Components: Principal Component Analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acp.pdf
Dataset: autos_acp.xls
References:
G. Saporta, " Probabilités, Analyse de données et Statistique ", Dunod, 2006 ; pages 177 to 181.
D. Garson, "Factor Analysis".
Statsoft Textbook, "Principal components and factor analysis".

Multiple Correspondence Analysis (MCA)

The multiple correspondence analysis is a factor analysis approach. It deals with a tabular dataset where a set of examples are described by a set of categorical variables. The aim is to map the dataset in a reduced dimension space (usually two) which allows us to highlight the associations between the examples and the variables. It is useful to understand the underlying structure of a tabular dataset.

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. The opportunity to copy/paste the results in a spreadsheet is certainly one of the most interesting functionalities of the software. Indeed, it gives us access to tools (tri, formatted, etc) in a well-known environment of the experts of the data processing. For example, the possibility of sorting the various tables according to the contributions and the COS2 proves really practical when one wishes to interpret the dimensions.

Keywords: factor analysis, multiple correspondence analysis
Components: Multiple correspondance analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acm.pdf
Dataset: races_canines_acm.xls
References:
M. Tenenhaus, " Méthodes statistiques en gestion ", Dunod, 1996 ; pages 212 to 222 (in French).
Statsoft Inc., "Multiple Correspondence Analysis".
D. Garson, "Statnotes - Correspondence Analysis".

Sunday, April 26, 2009

Support Vector Regression (SVR)

Support Vector Machines (SVM) is a well-know approach in the machine learning community. It is usually implemented for a classification problem in a supervised learning framework. But SVM can be used also in a regression process, where we want to predict or explain the values taken by a continuous predicted attribute. We say Support Vector Regression in this context.

The method is not widely diffused among statisticians. Yet it combines the qualities that rank it favorably compared with existing techniques. It has a well behavior even if the ratio between the number of variables and the number of observations becomes very unfavorable, with highly correlated predictors. Another advantage is the principle of kernel (the famous "kernel trick"). It is possible to construct a non-linear model without explicitly having to produce new descriptors. A deeply study of the characteristics of the method allows to make comparison with penalized regression such as ridge regression.

The first subject of this tutorial is to show how to use two new SVR components of the 1.4.31 version of Tanagra. They are based on the famous LIBSVM library. We use the same library for the classification (see C-SVC component). We compare our results to those of the R software (version 2.8.0). We utilize the e1071 package for R. It is also based on the LIBSVM library.

The second subject is to propose a new assessment component for the regression. It is usual in the supervised learning framework to split the dataset into two parts, the first for the learning process, the second for its evaluation, in order to obtain an unbiased estimation of the performances. We can implement the same approach for the regression. The procedure is even essential when we try to compare models with various complexities (or various degrees of freedom). We will see in this tutorial that the usual indicators calculated on the learning data are highly misleading in certain situations. We must use an independent test set when we want assess a model.

Keywords: support vector regression, support vector machine, regression, linear regression, regression assessment, R software, package e1071
Components: MULTIPLE LINEAR REGRESSION, EPSILON SVR, NU SVR, REGRESSION ASSESSMENT
Tutorial: en_Tanagra_Support_Vector_Regression.pdf
Dataset: qsar.zip
References :
C.C. Chang, C.J. Lin, "LIBSVM - A Library for Support Vector Machines".
S. Gunn, « Support Vector Machine for Classification and Regression », Technical Report of the University of Southampton, 1998.
A. Smola, B. Scholkopf, « A tutorial on Support Vector Regression », 2003.

Thursday, April 23, 2009

Launching Tanagra from OOo Calc under Linux

The integration of Tanagra into a spreadsheet, such as Excel or Open Office Calc (OOo Calc or OOCalc), is undoubtedly an advantage. Without special knowledge about the database format, the user can handle the dataset into a familiar environment, the spreadsheet, and send it to specialized tools for Data Mining when he want to lead more sophisticated analysis.

The add-on for OOCalc is initially created for Windows OS. Recently, I have described the installation and the utilization of Tanagra under Linux . The next step is of course the integration of Tanagra into OOCalc under Linux.Mr. Thierry Leiber has realized this work for the 1.4.31 version of Tanagra. He has extended the existing add-on. We can launch Tanagra from OOCalc now, either under Windows and Linux. The add-on was tested under the following configurations: Windows XP + OOCalc 3.0.0; Windows Vista + OOCalc 3.0.1; Ubuntu 8.10 + OOCalc 2.4; Ubuntu 8.1 + OOCalc 3.0.1.

This document extends a previous tutorial, but we work now under the Linux environment (Ubuntu 8.10). All the screen shots are in French because my OS is in French, but I think the process is the same for Linux with other language configuration.

Keywords: open office calc, add-on, principal component analysis, PCA, correlation circle, illustrative variable, linux, ubuntu 8.10 intrepid ibex
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT
Tutorial: en_Tanagra_OOCalc_under_Linux.pdf
Dataset: cereals.xls
References:
Tanagra, « Connection with Open Office Calc »
Tanagra, « Tanagra under Linux »

Wednesday, April 15, 2009

Tanagra - Version 1.4.31

Thierry Leiber has improved the add-on making the connection between Tanagra and Open Office. It is now possible, under Linux, to install the add-on for Open Office and launch Tanagra directly after selecting the data (see the tutorials on installing Tanagra under Linux and the integration of add-on in Open Office Calc). Thierry, thank you very much for this contribution which helps the users of Tanagra.

Following a suggestion of Mr. Laurent Bougrain, the confusion matrix is added to the automatic saving of results in experiments. Thank you to Laurent, and all others, who by their constructive comments helps me upgrade Tanagra in the right direction.

In addition, two new components for regression using the support vector machine principle (support vector regression) were added: Epsilon-Nu-SVR and SVR. A tutorial shows these methods and compare our results with the R software will be available soon. Tanagra, as with the R package "e1071", are based on the famous LIBSVM library.

Tutorials about these releases are coming soon.

Thursday, March 19, 2009

Cost-sensitive learning - Comparison of tools

Everyone agrees that taking into consideration the misclassification costs is an important aspect of the practice of Data Mining. For instance, diagnosing disease for a healthy person does not produce the same consequences as to predict health for an ill person. Yet despite its importance, the topic is seldom addressed, both from a theoretical point of view i.e. how to integrate cost during the evaluation of models (easy) and their construction (a little less easy); and from the practical point of view i.e. how to implement the approach in software.

Using the misclassification cost during the classifier evaluation is easy. We make a cross-product between the misclassification cost matrix and the confusion matrix. We obtain an "expected misclassification cost" (or an expected gain if we multiply the result by -1). Its interpretation is not very easy. It is mainly used for the comparison of models.

Handling costs during the learning process is less usual. Several approaches are possible. In this tutorial, we show how to use some components of Tanagra intended to cost-sensitive supervised learning on a real (realistic) dataset. We also programmed the same procedures in the R software (http://www.r-project.org/) to give a better visibility on what is implemented. We compare our results with those of Weka. The algorithm underlying our analysis is a decision tree. According to the software, we use C4.5, CART or J48.

Keywords: supervised learning, cost sensitive learning, misclassification cost matrix, decision tree algorithm, Weka 3.5.8, R 2.8.0, rpart package
Tutorial: en_Tanagra_Cost_Sensitive_Learning.pdf
Dataset: dataset-dm-cup-2007.zip
References:
J.H. Chauchat, R. Rakotomalala, M. Carloz, C. Pelletier, "Targeting Customer Groups using Gain and Cost Matrix: a Marketing Application", PKDD-2001.
"Cost-sensitive Decision Tree", Tutorials for Sipina.

Thursday, February 26, 2009

Predictive association rules

The algorithms for association rules extraction were originally developed to find the logical relation between variables with the same status. The predictive association rules instead seek to generate association of items that characterize a dependent attribute. We are in a supervised learning framework.

Basically, the algorithm is not really modified. Exploration is just limited to itemsets that include the dependent variable. The computation time is then reduced. Two components of Tanagra are dedicated to this task; these are SPV ASSOC RULE and SPV ASSOC TREE. They are available in the Association tab.Compared to conventional approaches, the components of Tanagra introduce additional specificity: we have the possibility to specify the class value ("dependent variable = value") that you wish to predict. The interest is to finely set the parameters of the algorithm, directly related to the characteristics of data. This is crucial for example when the prior probabilities of the dependent variable values are very different.

We had already submitted the component SPV TREE ASSOC elsewhere. But it was in the context of multivariate characterization of groups of individuals (from a clustering algorithm for instance). We compare it to the GROUP CHARACTERIZATION component. In this tutorial, we will compare the behavior of SPV ASSOC TREE and SPV ASSOC RULE during a prediction task. We will put forward their shared properties, the problems they can handle, and their differences. SPV ASSOC RULE, which supplies original rule interestingness measures ("test value" indicator), has the ability to simplify the rule base.

Keywords: predictive association rules, interestingness measure, rule base ranking, rule base simplification
Components: SPV ASSOC TREE, SPV ASSOC RULE
Tutorial: en_Tanagra_Predictive_AssocRules.pdf
Dataset: credit_assoc.xls

Sunday, February 22, 2009

Interestingness measures for association rules

This document outlines the measures to assess association rules proposed by the A PRIORI MR and SPV ASSOC RULE components. They come from studies reported in several publications of A. Morineau and R. Rakotomalala.

A measure characterizes the relevance of a rule. It can be used to rank them. It should also help to discern those that are "significantly interesting" from those who are irrelevant. This last point is totally prospective. There is no really satisfactory solution at this time.

The A PRIORI MR and the SPV ASSOC RULE components are experimental tools for the evaluation of the rules extracted by the association rule induction algorithm. They allow to evaluate the rules using measures based on the test value principle.

Keywords: association rules, interestingness measure, test value
Components: A PRIORI MR, SPV ASSOC RULE
Tutorial: en_Tanagra_APrioriMR_Measures.pdf
References:
R. Rakotomalala, A. Morineau, 2008. “The TVpercent principle for the counterexamples statistic”, in Statistical Implicative Analysis, Studies in Computational Intelligence Series, 127, 449-462, Springer, 2008 -- http://www.springerlink.com/content/g245317206950529/
Wikipedia, "Association rule learning"

Tuesday, January 27, 2009

Performance comparison under Linux

The gain chart is an alternative to confusion matrix for the evaluation of a classifier. Its name is sometimes different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.).

The main idea is to elaborate a graph where the X coordinates is the percent of the population and the Y coordinates is the percent of the positive value of the class attribute. The gain chart is used mainly in the marketing domain where we want to detect potential customers, but it can be used in other situations.

The construction of the gain chart is already outlined in a previous tutorial (see http://data-mining-tutorials.blogspot.com/2008/11/lift-curve-coil-challenge-2000.html). In this tutorial, we extend the description to other data mining tools (Knime, RapidMiner, Weka and Orange). The second originality of this tutorial is that we lead the experiment under Linux (French version of Ubuntu 8.10 – see http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html for the installation and the utilization of Tanagra under Linux). The third originality is that we handle a large dataset with 2,000,000 examples and 41 variables. It will be very interesting to study the behavior of these tools in this configuration, especially because our computer is not really powerful. We note that some tools failed the analysis on the complete dataset.

Keywords: scoring, linear discriminant analysis, naive bayes classifier, lift curve, gain chart, cumulative gain chart, knime, rapidminer, weka, orange
Components: SAMPLING, LINEAR DISCRIMINANT ANALYSIS, SCORING, LIFT CURVE
Tutorial: en_Tanagra_Gain_Chart.pdf
Dataset: dataset_gain_chart.zip

Saturday, January 24, 2009

Sipina under Linux

In a recent tutorial, we show that it is possible to work with Tanagra under Linux using Wine. In this document, we implement Sipina (a data mining software intended to decision tree induction) with the same framework i.e. we install and use Sipina in a Linux environment. We use the Ubuntu distribution (French version 8.10). All the functionalities of Sipina are available, especially the interactive tools which allows us to explore deeply the subpopulation into a node of the tree.

In this tutorial, we implement the following steps: (1) Installing Sipina under Linux; (2) Launching the software; (3) Loading a dataset (text file with tab separator); (4) Choosing the class attribute and the predictive variables; (5) Partitioning the dataset in a train set and test set; (6) Computing the tree on the train set; (7) Evaluation the tree on the test set e.g. computing the confusion matrix, the error rate, etc.; (8) Exploring a subpopulation related to a node of the tree; (9) Launching a new analysis on a subpopulation related to a node of the tree.

We will describe quickly the various features of the software in this tutorial. They are already presented in several documents available online (http://eric.univ-lyon2.fr/~ricco/sipina.html, see the DOWNLOAD section). Our main goal here is to show the capabilities of Sipina under Linux.

We use the French Ubuntu 8.10 distribution; we have installed also Wine, a program which allows to Windows programs to run under Linux.

Keywords: linux, ubuntu, wine, sipina, decision tree
Tutorial: en_Sipina_under_Linux.pdf
References:
Ubuntu, http://www.ubuntu.com/
Wine, https://help.ubuntu.com/community/Wine

Tuesday, January 13, 2009

Tanagra under Linux

The users ask sometimes "Can I use Tanagra under Linux?" The answer is YES and NO.

NO, we cannot execute natively Tanagra under Linux. It is a 32-bits program for Windows.

But YES, we can run Tanagra under Linux using WINE, a famous Linux application which allows us to run Windows programs on Linux. We can then take all the advantages of Tanagra without asking any questions about compatibilities.

In this tutorial, we show how to install and run Tanagra under Ubuntu (a free of charge version of Linux) using WINE. We can fully use Tanagra in the Linux environment.

Keywords: linux, ubuntu, wine
Tutorial: en_Tanagra_under_Linux.pdf
References:
Ubuntu, http://www.ubuntu.com/
Wine, https://help.ubuntu.com/community/Wine