This Web log maintains an alternative layout of the tutorials about Tanagra. Each entry describes shortly the subject, it is followed by the link to the tutorial (pdf) and the dataset. The technical references (book, papers, website,...) are also provided. In some tutorials, we compare the results of Tanagra with other free software such as Knime, Orange, R software, Python, Sipina or Weka.
Thursday, December 24, 2009
VARIMAX rotation in Principal Component Analysis
The goal is to associate each variable to at most one factor. The interpretation of the results of the PCA will be simplified. Then each variable will be associated to one and one only factor, they are split (as much as possible) into disjoint sets.
In this tutorial, we show how to perform this kind of rotation from the results of a standard PCA in Tanagra.
Keywords: PCA, principal component analysis, VARIMAX, QUARTIMAX
Components : Principal Component Analysis, Factor Rotation
Tutorial: en_Tanagra_Pca_Varimax.pdf
Dataset: crime_dataset_from_DASL.xls
References:
Tanagra, "New features for PCA in Tanagra"
Tanagra, "Principal Component Analysis (PCA)"
Wikipedia, "Varimax rotation"
H. Abdi, "Factor rotations in Factor Analyses"
Sunday, December 20, 2009
Kruskal–Wallis one-way analysis of variance
We talk about non parametric tests when we do not make assumption about the shape of the distribution of the dependent variable. They are considered as being "distribution free" methods, at the opposite of the parametric approaches.
In this tutorial, we implement various tests for differences in location. The Kruskal-Wallis test is certainly the most used one when we try to determine if the scores among groups are stochastically the same. But other tests exist. We compare the results obtained. We will complete the analysis by conducting multiple comparisons in order to identify groups that differ significantly from each other.
Keywords: non parametric test, independent samples, Kruskal-Wallis, Van der Waerden, Fisher-Yates-Terry-Hoeffding, median test, tests for differences in location
Components: KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, FYTH 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_KW_and_related.pdf
Dataset: wine_evaluation_nonparametric.xls
References:
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 14a. The Kruskal-Wallis Test for 3 or More Independent Samples.
Wikipedia. Kruskal–Wallis one-way analysis of variance.
Thursday, December 17, 2009
Tests for differences in scale
The tests of equal variability (or dispersion, or scale, or simply variance) are often presented as a preliminary test before the comparison of means, in order to verify the homoscedasticity assumption. But this is not their only purpose. Compare dispersions can be an end in itself. For example, we wish to compare the performance of two systems of heating. The average temperature at the center of the room is the same; however one can wish to compare the mode of diffusion of heat in different parts of the room.
The parametric tests are based primarily on the Gaussian distribution. The test becomes a test for homogeneity of variance. We highlight the Levene test in this tutorial. Other tests exist (Bartlett test for instance), we mention them in this tutorial.
When the normality assumption is questionable, when sample size is low, when the variable is ordinal and not continuous, it is more appropriate to use non parametric tests. These are called tests for equality of scales or dispersions. In fact the procedures are not based on estimated variances. We will use well known techniques such as the Ansari-Bradley test, the Mood or the Klotz test. They have a scope broader since nonparametric. Some of these tests have a drawback, they are not applicable when the distributions conditionals do not share the same parameter of central tendency (the median in general, but we can adjust the values by centering in relation to the median).
In this tutorial, we show how to implement these various tests with Tanagra.
Keywords: parametric test, non parametric test, independent samples, Levene test, Bartlett test, Brown-Forsythe test, Mood test, Klotz test, Ansari-Bradley test
Components: LEVENE’S TEST, ANSARI-BRADLEY SCALE TEST, MOOD SCALE TEST, KLOTZ SCALE TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Scale_Differences.pdf
Dataset: tests_for_scale_differences.xls
References:
NIST, "Quantitative techniques", section 1.3.5 - http://www.itl.nist.gov/div898/handbook/eda/section3/eda35.htm
Wednesday, December 9, 2009
Outliers and influential points in regression
In this tutorial, we implement several indicators for the analysis of outliers and influential points. To avoid confusion about the definitions of indicators (some indicators are calculated differently from one tool to another), we compare our results with state-of-the-art tool such as SAS and R. In a first step, we give the results described into the SAS documentation. In a second step, we describe the process and the results under Tanagra and R. In conclusion, we note that these tools give the same results.
Keywords: linear regression, outliers, influential points, standardized residuals, studentized residuals, leverage, dffits, cook's distance, covratio, dfbetas, R software
Components: Multiple linear regression, Outlier detection, DfBetas
Tutorial: en_Tanagra_Outlier_Influential_Points_for_Regression.pdf
Dataset: USPopulation.xls
References:
SAS STAT User’s Guide, « The REG Procedure – Predicted and Residual Values »
Monday, December 7, 2009
Tests for comparing two related samples
The aim of tests for related samples is to exclude from the analysis the within-group variation. The calculation of the differences is realized within each pair of subjects. In this tutorial, we show how to implement 3 tests for two related samples. Two of them are non-parametric (sign test and Wilcoxon matched-pairs ranks test), the last one is the parametric t-test for related samples.
Keywords: parametric test, non-parametric test, paired samples, sign test, wilcoxon signed rank test, paired samples t-test, normality test
Components: SIGN TEST, WILCOXON SIGNED RANK TEST, PAIRED T-TEST, FORMULA, NORMALITY TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Two_Related_Samples.pdf
Dataset : comparison_2_related_samples.xls
References :
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 12a. The Wilcoxon Signed-Rank Test.
Wednesday, December 2, 2009
Multivariate tests for comparing populations
A multivariate test for comparison of population try to determine if K (K 2) samples come from the same underlying population according to a set of variables of interest (X1,…,Xp).
We talk about parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the data is drawn from a multivariate Gaussian distribution, the hypothesis testing relies on mean vector or on covariance matrix.
Keywords: Hotelling's T2, Wilks' Lambda, Box’s M test, Bartlett's test, mean vector, covariance matrix, MANOVA
Components: UNIVARIATE CONTINUOUS STAT, HOTELLING’S T2, HOTELLING’S T2 HETEROSCEDASTIC, BOX’S M TEST, ONE-WAY MANOVA
Lien: en_Tanagra_Multivariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References :
S. Rathburn, A. Wiesner, "STAT 505: Applied Multivariate Statistical Analysis", The Pennsylvania State University.
Monday, November 30, 2009
Parametric tests for comparing populations
The tests for comparison of population try to determine if K (K >= 2) samples come from the same underlying population according to a variable of interest (X). We talk parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the distribution of the data is Gaussian, the hypothesis testing relies on mean or on variance.
We handle univariate test in this tutorial i.e. we have only one variable of interest. When we want to analyze simultaneously several variables, we talk about multivariate test.
Keywords: t-test, F-Test, Bartlett's test, Levene's test, Brown-Forsythe's test, independent samples, dependent samples, paired samples, matched-pairs samples, anova, welch's anova, randomized complete blocks
Components: MORE UNIVARIATE CONT STAT, NORMALITY TEST, T-TEST, T-TEST UNEQUAL VARIANCE, ONE-WAY ANOVA, WELCH ANOVA, FISHER’S TEST, BARTLETT’S TEST, LEVENE’S TEST, BROWN-FORSYTHE TEST, PAIRED T-TEST, PAIRED V-TEST, ANOVA RANDOMIZED BLOCKS
Tutorial: en_Tanagra_Univariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References:
NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ (Chapter 7, Product and Process Comparisons)
Thursday, November 26, 2009
Three curves for classifier assessment
Tutorial: en_Tanagra_Spv_Learning_Curves.pdf
Dataset : heart_disease_for_curves.zip
Sunday, November 22, 2009
Tanagra - Version 1.4.34
The DECISION LIST component has been improved, we changed the test done during the pre-pruning process. The formula is described in the tutorial above.
The SAMPLING and STRATIFIED SAMPLING components (Instance Selection tab) have been slightly modified. It is now possible to set ourself the seed number of the pseudorandom number generator.
Following an indication of Anne Viallefont, calculation of degrees of freedom in tests on contingency tables is now more generic. Indeed, the calculation was wrong when the database was filtered and some margins (row or column) contained a number equal to zero. Anne, thank you for this information. More generally, thank you to everyone who sent me comments. Programming has always been for me a kind of leisure. The real work starts when it is necessary to check the results, compare them with the available references, cross them with other data mining tools, free or not, understand the possible differences, etc.. At this step, your help is really valuable.
Monday, November 9, 2009
Handling Missing values in SIPINA
Various techniques are available in order to handle missing values into SIPINA. In this tutorial, we show how to implement them; and what are their consequences on the decision tree learning context (C4.5 algorithm; Quinlan, 1993).
Keywords: missing value, missing data, listwise deletion, casewise deletion, data imputation, C4.5, decision tree
Tutorial: en_Sipina_Missing_Data.pdf
Dataset: ronflement_missing_data.zip
References:
P.D. Allison, « Missing Data », in Quantitative Applications in the Social Sciences Series n°136, Sage University Paper, 2002.
J. Bernier, D. Haziza, K. Nobrega, P. Whitridge, « Handling Missing Data – Case Study », Statistical Society of Canada.
D. Garson, "Data Imputation for Missing Values"
Wednesday, November 4, 2009
Model deployment with Sipina
Tuesday, November 3, 2009
Sipina - Supported file format
The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with 4,817,099 instances and 42 variables.
Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.
Keywords: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm
Tutorial: en_Sipina_File_Format.pdf
Dataset: weather.txt and kdd-cup-discretized-descriptors.txt.zip
Saturday, October 31, 2009
Importing Weka file (.arff) into Sipina
The text file format is very simple and very easy to manipulate. But, on the other hand, the processing of this kind of file is often slow, slower than binary file format. When we deal with a moderate size file, the text file is enough efficient. The differences between the time processing are not discernible.
In this tutorial, we show how to import the ARFF file format into Sipina. We subdivide the dataset into train and test samples. Then we learn and we assess a decision tree.
Keywords: decision tree, c4.5, file format, data file importation, weka, arff
Tutorial: en_sipina_weka_file_format.pdf
Dataset: ionosphere.arff
References:
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I. Witten, "The Weka Data Mining Software: An Update", SIGKDD Explorations, Vol. 11, Issue 1, 2009.
Wednesday, October 28, 2009
Local sampling for decision tree learning
For all the decision tree algorithms, Sipina can use a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes.
We have described this approach in a paper (Chauchat and Rakotomalala, IFCS-2000) . We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node enables to reduce dramatically the execution time without loss of accuracy.
We use a version of the WAVEFORM dataset with 21 continuous descriptors and 2,000,000 instances. We obtain the tree in 3 seconds on our computer.
Keywords : decision tree, sampling, large dataset
Components : SAMPLING, ID3, TEST
Tutorial : en_Sipina_Sampling.pdf
Dataset : wave2M.zip
Références :
J.H. Chauchat, R. Rakotomalala, « A new sampling strategy for building decision trees from large databases », Proc. of IFCS-2000, pp. 199-204, 2000.
Saturday, October 3, 2009
Tanagra - Version 1.4.33
1.The estimated covariance matrix
2. Hosmer - Lemeshow Test
3. Reliability diagram (says also calibration plot)
4. Analysis of residuals, outilers and influentials points (pearson residuals, deviance residuals, dfichisq, difdev, levier, Cook's distance, dfbeta, dfbetas)
A tutorial describing the utilization of these tools will be available soon.
Monday, September 28, 2009
Using batch mode for Tanagra
In this tutorial, we want to compare the performances of the naïve bayes classifier with and without the feature selection process. We know that the naïve bayes classifier is highly sensitive to irrelevant features. The goal of this tutorial is to evaluate the efficiency of the FCBF feature selection method in this context.
Keywords: batch mode, supervised learning, naive bayes, feature selection, experiments
Components: NAIVE BAYES, FCBF, CROSS VALIDATION
Tutorial: english_dr_utiliser_tanagra_en_mode_batch.pdf
Dataset: tanagra_batch_execution.zip
Wednesday, July 15, 2009
Nonparametric tests for groups comparison - Independent samples - Differences in location
The nonparametric tests make no assumptions about the distribution of the data. They are called also "distribution free" tests.
In this tutorial, we show how to implement nonparametric homogeneity tests for differences in location for K = 2 populations i.e. the distributions of the populations are the same excepting a shift in location (central tendency). The Kolmogorov-Smirnov test is the more general one. It checks all kind of differences between the cumulative distribution functions (CDF). Afterwards, we can implement other tests which characterize more deeply the difference. The Wilcoxon-Mann-Whitney test is certainly the most popular one. We will see in this tutorial that other tests can be also implemented.
Some the tests introduced here are usable when the number of groups is upper than 2 (K > 2).
Keywords: nonparametric test, Kolmogorov-Smirnov test, Wilcoxon-Mann-Whitney test, Van der Waerden test, Fisher-Yates-Terry-Hoeffding test, median test, location model
Components: FYTH 1-WAY ANOVA, K-S 2-SAMPLE TEST, MANN-WHITNEY COMPARISON, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_MW_and_related.pdf
Dataset: machine_packs_cartons.xls
References:
R. Rakotomalala, « Comparaison de populations. Tests non paramétriques », Université Lyon 2 (in french).
Wikipedia, « Non-parametric statistics ».
Thursday, July 9, 2009
Resampling methods for error estimation
In the small sample context, it is preferable to implement the resampling approaches for error rate estimation. In this tutorial, we study the behavior of the cross validation (cv), leave one out (lvo) and bootstrap (boot). All of them are based on the repeated train-test process, but in different configurations. We keep in mind that the aim is to evaluate the error rate of the classifier created on the whole sample. Thus, the intermediate classifiers computed on each learning session are not really interesting. This is the reason for which they are rarely provided by the data mining tools.
The main supervised learning method used is the linear discriminant analysis (LDA). We will see at the end of this tutorial that the behavior observed for this learning approach is not the same if we use another approach such as a decision tree learner (C4.5).
Keywords: resampling, generalization error rate, cross validation, bootstrap, leave one out, linear discriminant analysis, C4.5
Components: Supervised Learning, Cross-validation, Bootstrap, Test, Leave-one-out, Linear discriminant analysis, C4.5
Tutorial: en_Tanagra_Resampling_Error_Estimation.pdf
Dataset: wave_ab_err_rate.zip
Reference:
"What are cross validation and bootstrapping?"
Sunday, July 5, 2009
Implementing SVM on large dataset
SVM is effective in domains with very high number of predictive variables, when the ratio between the number of variables and the number of observations is unfavorable. We are in a domain which is particularly favorable to SVM in this tutorial. We want to discriminate two families of proteins from their description with amino acids. We use sequence of 4 characters (4-grams) as descriptors. Thus, we can have a large number of descriptors (31,809) in comparison to the number of examples (135 instances).
We compare Tanagra 1.4.27, Orange 1.0b2, Rapidminer Community Edition 4.2 and Weka 3.5.6.
Keywords: svm, support vector machine
Components: C-SVC, SVM, SUPERVISED LEARNING, CROSS-VALIDATION
Tutorial: en_Tanagra_Perfs_Comp_SVM.pdf
Dataset: wide_protein_classification.zip
Reference:
Wikipedia (en), « Support vector machine »
Wednesday, July 1, 2009
Self-organizing map (SOM)
In this tutorial, we show how to implement the Kohonen's SOM algorithm with Tanagra. We try to assess the properties of this approach by comparing the results with those of the PCA algorithm. Then, we compare the results to those of K-Means, which is a clustering algorithm. Finally, we implement the Two-step Clustering process by combining the SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering where we combine K-Means and HAC. We observe that the HAC primarily merges the adjacent cells.
Keywords: Kohonen, self organizing map, SOM, clustering, dimensuionality reduction, k-means, hierarchical agglomerative clustering, hac, two-step clustering
Components: UNIVARIATE CONTINUOUS STAT, UNIVARIATE OUTLIER DETECTION, KOHONEN-SOM, PRINCIPAL COMPONENT ANALYSIS, SCATTERPLOT, K-MEANS, CONTINGENCY CHI-SQUARE, HAC
Tutorial: en_Tanagra_Kohonen_SOM.pdf
Dataset: waveform_unsupervised.xls
Reference:
Wikipedia, « Self organizing map », http://en.wikipedia.org/wiki/Self-organizing_map
Monday, June 29, 2009
Univariate outlier detection methods
Outliers can be detected on one variable (a man with 158 years old) or on a combination of variables (a boy with 12 years old crosses the 100 yards in 10 seconds). In this tutorial, we show how to use the UNIVARIATE OUTLIER DETECTION component. It is intended to univariate detection of outliers i.e. taking into account individually the variables.
The approaches implemented in the component come from the NIST website (see reference). We use also an additional rule based on the x-sigma deviation from the mean of the variable.
Keywords: outlier, influential point
Components: MORE UNIVARIATE CONT STAT, SCATTERPLOT WITH LABEL, UNIVARIATE OUTLIER DETECTION, UNIVARIATE CONT STAT
Tutorial: en_Tanagra_Outliers_Detection.pdf
Dataset: body_mass_index.xls
References:
NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.1.6, « What are outliers in the data ? »
R. High, "Dealing with 'Outliers': How to Maintain Your Data's Integrity"
Copy paste feature into the diagram
When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the diagram. This feature is very helpful when we have to repeat sequences of treatments in different parts of the diagram. The settings are also duplicated.
In this tutorial, we show how to copy a component or a branch. We will see that this feature is helpful when, for instance, we deal with the performance comparisons of supervised learning algorithms on the same dataset. In this context, the processing sequence is always the same, only the method that we want to evaluate is different.
We work on the same project here. We cannot copy paste components between two opened projects. But, in another tutorial, we show how to save a part of the diagram in an external file. Thus, the same processing sequence can be applied on multiple datasets.
Keywords: copy paste, diagram management, comparison of classifiers, supervised learning, cross validation, dimensionality reduction
Components: Supervised learning, Binary logistic regression, C-PLS, C-SVC, Linear discriminant analysis, K-NN, Principal Component Analysis
Tutorial: en_Tanagra_Diagram_New_Features.pdf
Dataset: sonar.xls
Saturday, June 27, 2009
The A PRIORI MR component
We were already described the association rule mining tools of Tanagra in several tutorials. The A PRIORI approach is certainly the most popular. But, despite its good properties, this method has a drawback: the number of obtained rules can be very high. The ability to underline the most interesting rules, those which are relevant, becomes a major challenge.
In this tutorial, we show to implement the A PRIORI MR component. It differentiates oneself from other by offering additional tools for exploring and assessing the mined rules: original measures based on the “test value” principle allow to evaluate differently the rules; the ability to copy the results into a spreadsheet allows a more detailed exploration of the rule base; by subdividing the dataset into train and test sets, we obtain a more reliable values of the interestingness measures of rules.
Keywords: association rule, a priori algorithm, interestingness measure, test value principle
Components: A PRIORI MR
Tutorial: en_Tanagra_APrioriMR_Component.pdf
Dataset: credit_assoc.xls
Reference:
Wikipedia, "Association rule learning"
Sunday, June 14, 2009
Two-step clustering for handling large databases
The implementation of the two-step clustering (called also “Hybrid Clustering”) under Tanagra is already described elsewhere. According to the Lebart and al. (2000) recommendation , we perform the clustering algorithm on the latent variables supplied by a PCA (Principal Component Analysis) computed from the original variables. This pre-treatment cleans the dataset by removing the irrelevant information such as noise, etc. In this tutorial, we show the efficiency of the approach on a large dataset with 500,000 observations and 68 variables. We use Tanagra 1.4.27 and R 2.7.2 which are the only tools which allow to implement easily the whole process.
Keywords: clustering, hierarchical cluster analysis, HCA, k-means, principal component analysis, PCA
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, HAC, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_CAH_Mixte_Gros_Volumes.pdf
Dataset: sample-census.zip
References:
L. Lebart, A. Morineau, M. Piron, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000 ; chapter 2, sections 2.3 et 2.4.
D. Garson, "Cluster Analysis" from North Carolina State University.
Thursday, June 11, 2009
K-Means - Comparison of free tools
The K-Means approach is already described in several tutorials (http://data-mining-tutorials.blogspot.com/search?q=k-means). The goal here is to compare its implementation with various free tools. We study the following tools: Tanagra 1.4.28; R 2.7.2 without additional package; Knime 1.3.5; Orange 1.0b2 and RapidMiner Community Edition.
Keywords: clustering, k-means, PCA, principal component analysis, MDS,multidimensional scaling
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_et_les_autres_KMeans.pdf
Dataset: cars_dataset.zip
Reference:
D. Garson, "Cluster Analysis"
Saturday, May 30, 2009
Understanding the "test value" criterion
The principle is elementary: we compare the values of a descriptive statistic indicator computed on the whole sample and computed on sub sample related to the group. For a continuous variable, we compare the mean; for a discrete one, we compare the proportion.
Despite, or because of its simplicity, the VT is very useful. The formulation that we present in this tutorial is taken from the Lebart et al.’s book (2001). The VT is intensively used in some commercial software such as SPAD (http://eng.spad.eu/). It allows to characterize groups, but it can be used also to strengthen the interpretation of the factors extracted from a factorial analysis process.
In this tutorial, we emphasis the formulas used for both categorical and continuous variables. We put them in connection with the results provided by the GROUP CHARACTERIZATION component of TANAGRA.
Keywords: test value, group characterization, clustering, factorial analysis
Components: Group characterization
Tutorial: en_Tanagra_Comprendre_La_Valeur_Test.pdf
Dataset: heart_disease_male.xls
Reference:
L. Lebart, A. Morineau, M. Piron, « Statistique exploratoire multidimensionnelle », Dunod, 2000 ; pages 181 to 184.
Friday, May 29, 2009
Descriptive statistics (continued)
In this tutorial, we distinguish two kinds of descriptive approaches: the univariate tools which summarize the characteristics of a variable individually; the bivariate tools which characterize the association between two variables. According to the type of the variables (categorical or continuous), we use different indicators.
Keywords: descriptive statistics
Components: UNIVARIATE DISCRETE STAT, CONTINGENCY CHI-SQUARE, UNIVARIATE CONTINUOUS STAT, SCATTERPLT, LINEAR CORRELATION, GROUP CHARACTERIZATION
Tutorial: en_Tanagra_Descriptive_Statistics.pdf
Dataset: enquete_satisfaction_femmes_1953.xls
References:
Tanagra Tutorials, "Descriptive statistics"
Friday, May 1, 2009
ID3 on a large dataset
Commercial tools have often a very efficient data management systems, limiting the amount of data loaded into memory at each step of the treatment. Research tools, at the opposite, keep all data in memory. The limits are clearly the memory capacity of the machine in this context. It is certainly a drawback for the treatment of large files. We note however that, nowadays, we can have very powerful computers at least cost, this drawback is always postponed. With an appropriate encoding strategy, we can fit in memory all the dataset, even if we handle a large data file.
In this tutorial, we show how to import a file with 581,012 observations and 55 variables, and then how to build a decision tree with the ID3 method. In relation to other decision tree algorithm such as C4.5 or CART, the determination of the right size of the tree is based on a pre-pruning rule. We will see that the computation is fast because of this characteristic.
Keywords: large dataset, decision tree algorithm, ID3
Components: ID3, SPV LEARNING
Tutorial: en_Tanagra_Big_Dataset.pdf
Dataset: covtype.zip
References:
Tanagra tutorials, "Performance comparison under Linux"
Tanagra Tutorials, "Decision tree and large dataset"
Thursday, April 30, 2009
Principal Component Analysis (PCA)
In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. We use the AUTOS_ACP.XLS dataset from the state-of-the-art SAPORTA’s book. The interest of this dataset is that we can compare our results with those described in the book (pages 177 to 181). We simply show the sequence of operations and the reading of the results tables in this tutorial. About the detailed interpretation, it is best to refer to the book.
Keywords: factor analysis, principal component analysis, correlation circle
Components: Principal Component Analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acp.pdf
Dataset: autos_acp.xls
References:
G. Saporta, " Probabilités, Analyse de données et Statistique ", Dunod, 2006 ; pages 177 to 181.
D. Garson, "Factor Analysis".
Statsoft Textbook, "Principal components and factor analysis".
Multiple Correspondence Analysis (MCA)
In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. The opportunity to copy/paste the results in a spreadsheet is certainly one of the most interesting functionalities of the software. Indeed, it gives us access to tools (tri, formatted, etc) in a well-known environment of the experts of the data processing. For example, the possibility of sorting the various tables according to the contributions and the COS2 proves really practical when one wishes to interpret the dimensions.
Keywords: factor analysis, multiple correspondence analysis
Components: Multiple correspondance analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acm.pdf
Dataset: races_canines_acm.xls
References:
M. Tenenhaus, " Méthodes statistiques en gestion ", Dunod, 1996 ; pages 212 to 222 (in French).
Statsoft Inc., "Multiple Correspondence Analysis".
D. Garson, "Statnotes - Correspondence Analysis".
Sunday, April 26, 2009
Support Vector Regression (SVR)
The method is not widely diffused among statisticians. Yet it combines the qualities that rank it favorably compared with existing techniques. It has a well behavior even if the ratio between the number of variables and the number of observations becomes very unfavorable, with highly correlated predictors. Another advantage is the principle of kernel (the famous "kernel trick"). It is possible to construct a non-linear model without explicitly having to produce new descriptors. A deeply study of the characteristics of the method allows to make comparison with penalized regression such as ridge regression.
The first subject of this tutorial is to show how to use two new SVR components of the 1.4.31 version of Tanagra. They are based on the famous LIBSVM library. We use the same library for the classification (see C-SVC component). We compare our results to those of the R software (version 2.8.0). We utilize the e1071 package for R. It is also based on the LIBSVM library.
The second subject is to propose a new assessment component for the regression. It is usual in the supervised learning framework to split the dataset into two parts, the first for the learning process, the second for its evaluation, in order to obtain an unbiased estimation of the performances. We can implement the same approach for the regression. The procedure is even essential when we try to compare models with various complexities (or various degrees of freedom). We will see in this tutorial that the usual indicators calculated on the learning data are highly misleading in certain situations. We must use an independent test set when we want assess a model.
Keywords: support vector regression, support vector machine, regression, linear regression, regression assessment, R software, package e1071
Components: MULTIPLE LINEAR REGRESSION, EPSILON SVR, NU SVR, REGRESSION ASSESSMENT
Tutorial: en_Tanagra_Support_Vector_Regression.pdf
Dataset: qsar.zip
References :
C.C. Chang, C.J. Lin, "LIBSVM - A Library for Support Vector Machines".
S. Gunn, « Support Vector Machine for Classification and Regression », Technical Report of the University of Southampton, 1998.
A. Smola, B. Scholkopf, « A tutorial on Support Vector Regression », 2003.
Thursday, April 23, 2009
Launching Tanagra from OOo Calc under Linux
The add-on for OOCalc is initially created for Windows OS. Recently, I have described the installation and the utilization of Tanagra under Linux . The next step is of course the integration of Tanagra into OOCalc under Linux.Mr. Thierry Leiber has realized this work for the 1.4.31 version of Tanagra. He has extended the existing add-on. We can launch Tanagra from OOCalc now, either under Windows and Linux. The add-on was tested under the following configurations: Windows XP + OOCalc 3.0.0; Windows Vista + OOCalc 3.0.1; Ubuntu 8.10 + OOCalc 2.4; Ubuntu 8.1 + OOCalc 3.0.1.
This document extends a previous tutorial, but we work now under the Linux environment (Ubuntu 8.10). All the screen shots are in French because my OS is in French, but I think the process is the same for Linux with other language configuration.
Keywords: open office calc, add-on, principal component analysis, PCA, correlation circle, illustrative variable, linux, ubuntu 8.10 intrepid ibex
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT
Tutorial: en_Tanagra_OOCalc_under_Linux.pdf
Dataset: cereals.xls
References:
Tanagra, « Connection with Open Office Calc »
Tanagra, « Tanagra under Linux »
Wednesday, April 15, 2009
Tanagra - Version 1.4.31
Following a suggestion of Mr. Laurent Bougrain, the confusion matrix is added to the automatic saving of results in experiments. Thank you to Laurent, and all others, who by their constructive comments helps me upgrade Tanagra in the right direction.
In addition, two new components for regression using the support vector machine principle (support vector regression) were added: Epsilon-Nu-SVR and SVR. A tutorial shows these methods and compare our results with the R software will be available soon. Tanagra, as with the R package "e1071", are based on the famous LIBSVM library.
Tutorials about these releases are coming soon.
Thursday, March 19, 2009
Cost-sensitive learning - Comparison of tools
Using the misclassification cost during the classifier evaluation is easy. We make a cross-product between the misclassification cost matrix and the confusion matrix. We obtain an "expected misclassification cost" (or an expected gain if we multiply the result by -1). Its interpretation is not very easy. It is mainly used for the comparison of models.
Handling costs during the learning process is less usual. Several approaches are possible. In this tutorial, we show how to use some components of Tanagra intended to cost-sensitive supervised learning on a real (realistic) dataset. We also programmed the same procedures in the R software (http://www.r-project.org/) to give a better visibility on what is implemented. We compare our results with those of Weka. The algorithm underlying our analysis is a decision tree. According to the software, we use C4.5, CART or J48.
Keywords: supervised learning, cost sensitive learning, misclassification cost matrix, decision tree algorithm, Weka 3.5.8, R 2.8.0, rpart package
Tutorial: en_Tanagra_Cost_Sensitive_Learning.pdf
Dataset: dataset-dm-cup-2007.zip
References:
J.H. Chauchat, R. Rakotomalala, M. Carloz, C. Pelletier, "Targeting Customer Groups using Gain and Cost Matrix: a Marketing Application", PKDD-2001.
"Cost-sensitive Decision Tree", Tutorials for Sipina.
Thursday, February 26, 2009
Predictive association rules
Basically, the algorithm is not really modified. Exploration is just limited to itemsets that include the dependent variable. The computation time is then reduced. Two components of Tanagra are dedicated to this task; these are SPV ASSOC RULE and SPV ASSOC TREE. They are available in the Association tab.Compared to conventional approaches, the components of Tanagra introduce additional specificity: we have the possibility to specify the class value ("dependent variable = value") that you wish to predict. The interest is to finely set the parameters of the algorithm, directly related to the characteristics of data. This is crucial for example when the prior probabilities of the dependent variable values are very different.
We had already submitted the component SPV TREE ASSOC elsewhere. But it was in the context of multivariate characterization of groups of individuals (from a clustering algorithm for instance). We compare it to the GROUP CHARACTERIZATION component. In this tutorial, we will compare the behavior of SPV ASSOC TREE and SPV ASSOC RULE during a prediction task. We will put forward their shared properties, the problems they can handle, and their differences. SPV ASSOC RULE, which supplies original rule interestingness measures ("test value" indicator), has the ability to simplify the rule base.
Keywords: predictive association rules, interestingness measure, rule base ranking, rule base simplification
Components: SPV ASSOC TREE, SPV ASSOC RULE
Tutorial: en_Tanagra_Predictive_AssocRules.pdf
Dataset: credit_assoc.xls
Sunday, February 22, 2009
Interestingness measures for association rules
A measure characterizes the relevance of a rule. It can be used to rank them. It should also help to discern those that are "significantly interesting" from those who are irrelevant. This last point is totally prospective. There is no really satisfactory solution at this time.
The A PRIORI MR and the SPV ASSOC RULE components are experimental tools for the evaluation of the rules extracted by the association rule induction algorithm. They allow to evaluate the rules using measures based on the test value principle.
Keywords: association rules, interestingness measure, test value
Components: A PRIORI MR, SPV ASSOC RULE
Tutorial: en_Tanagra_APrioriMR_Measures.pdf
References:
R. Rakotomalala, A. Morineau, 2008. “The TVpercent principle for the counterexamples statistic”, in Statistical Implicative Analysis, Studies in Computational Intelligence Series, 127, 449-462, Springer, 2008 -- http://www.springerlink.com/content/g245317206950529/
Wikipedia, "Association rule learning"
Tuesday, January 27, 2009
Performance comparison under Linux
The main idea is to elaborate a graph where the X coordinates is the percent of the population and the Y coordinates is the percent of the positive value of the class attribute. The gain chart is used mainly in the marketing domain where we want to detect potential customers, but it can be used in other situations.
The construction of the gain chart is already outlined in a previous tutorial (see http://data-mining-tutorials.blogspot.com/2008/11/lift-curve-coil-challenge-2000.html). In this tutorial, we extend the description to other data mining tools (Knime, RapidMiner, Weka and Orange). The second originality of this tutorial is that we lead the experiment under Linux (French version of Ubuntu 8.10 – see http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html for the installation and the utilization of Tanagra under Linux). The third originality is that we handle a large dataset with 2,000,000 examples and 41 variables. It will be very interesting to study the behavior of these tools in this configuration, especially because our computer is not really powerful. We note that some tools failed the analysis on the complete dataset.
Keywords: scoring, linear discriminant analysis, naive bayes classifier, lift curve, gain chart, cumulative gain chart, knime, rapidminer, weka, orange
Components: SAMPLING, LINEAR DISCRIMINANT ANALYSIS, SCORING, LIFT CURVE
Tutorial: en_Tanagra_Gain_Chart.pdf
Dataset: dataset_gain_chart.zip
Saturday, January 24, 2009
Sipina under Linux
In this tutorial, we implement the following steps: (1) Installing Sipina under Linux; (2) Launching the software; (3) Loading a dataset (text file with tab separator); (4) Choosing the class attribute and the predictive variables; (5) Partitioning the dataset in a train set and test set; (6) Computing the tree on the train set; (7) Evaluation the tree on the test set e.g. computing the confusion matrix, the error rate, etc.; (8) Exploring a subpopulation related to a node of the tree; (9) Launching a new analysis on a subpopulation related to a node of the tree.
We will describe quickly the various features of the software in this tutorial. They are already presented in several documents available online (http://eric.univ-lyon2.fr/~ricco/sipina.html, see the DOWNLOAD section). Our main goal here is to show the capabilities of Sipina under Linux.
We use the French Ubuntu 8.10 distribution; we have installed also Wine, a program which allows to Windows programs to run under Linux.
Keywords: linux, ubuntu, wine, sipina, decision tree
Tutorial: en_Sipina_under_Linux.pdf
References:
Ubuntu, http://www.ubuntu.com/
Wine, https://help.ubuntu.com/community/Wine
Tuesday, January 13, 2009
Tanagra under Linux
NO, we cannot execute natively Tanagra under Linux. It is a 32-bits program for Windows.
But YES, we can run Tanagra under Linux using WINE, a famous Linux application which allows us to run Windows programs on Linux. We can then take all the advantages of Tanagra without asking any questions about compatibilities.
In this tutorial, we show how to install and run Tanagra under Ubuntu (a free of charge version of Linux) using WINE. We can fully use Tanagra in the Linux environment.
Keywords: linux, ubuntu, wine
Tutorial: en_Tanagra_under_Linux.pdf
References:
Ubuntu, http://www.ubuntu.com/
Wine, https://help.ubuntu.com/community/Wine