Tanagra - Data Mining and Data Science Tutorials: July 2009

Wednesday, July 15, 2009

Nonparametric tests for groups comparison - Independent samples - Differences in location

The aim of homogeneity test (or test for difference between groups) is to check if K (K >= 2) samples are drawn from the same population according to a variable of interest. In another words, we check if the probability distribution is the same in each sample.

The nonparametric tests make no assumptions about the distribution of the data. They are called also "distribution free" tests.

In this tutorial, we show how to implement nonparametric homogeneity tests for differences in location for K = 2 populations i.e. the distributions of the populations are the same excepting a shift in location (central tendency). The Kolmogorov-Smirnov test is the more general one. It checks all kind of differences between the cumulative distribution functions (CDF). Afterwards, we can implement other tests which characterize more deeply the difference. The Wilcoxon-Mann-Whitney test is certainly the most popular one. We will see in this tutorial that other tests can be also implemented.

Some the tests introduced here are usable when the number of groups is upper than 2 (K > 2).

Keywords: nonparametric test, Kolmogorov-Smirnov test, Wilcoxon-Mann-Whitney test, Van der Waerden test, Fisher-Yates-Terry-Hoeffding test, median test, location model
Components: FYTH 1-WAY ANOVA, K-S 2-SAMPLE TEST, MANN-WHITNEY COMPARISON, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_MW_and_related.pdf
Dataset: machine_packs_cartons.xls
References:
R. Rakotomalala, « Comparaison de populations. Tests non paramétriques », Université Lyon 2 (in french).
Wikipedia, « Non-parametric statistics ».

Thursday, July 9, 2009

Resampling methods for error estimation

The ability to predict correctly is one of the most important criteria to evaluate classifiers in supervised learning. The preferred indicator is the error rate (1 - accuracy rate). It states the probability of misclassification of a classifier. In most cases we do not know the true error rate because we do not have the whole population and we do not know the probability distribution of the data. So we need to compute estimation from the available dataset.

In the small sample context, it is preferable to implement the resampling approaches for error rate estimation. In this tutorial, we study the behavior of the cross validation (cv), leave one out (lvo) and bootstrap (boot). All of them are based on the repeated train-test process, but in different configurations. We keep in mind that the aim is to evaluate the error rate of the classifier created on the whole sample. Thus, the intermediate classifiers computed on each learning session are not really interesting. This is the reason for which they are rarely provided by the data mining tools.

The main supervised learning method used is the linear discriminant analysis (LDA). We will see at the end of this tutorial that the behavior observed for this learning approach is not the same if we use another approach such as a decision tree learner (C4.5).

Keywords: resampling, generalization error rate, cross validation, bootstrap, leave one out, linear discriminant analysis, C4.5
Components: Supervised Learning, Cross-validation, Bootstrap, Test, Leave-one-out, Linear discriminant analysis, C4.5
Tutorial: en_Tanagra_Resampling_Error_Estimation.pdf
Dataset: wave_ab_err_rate.zip
Reference:
"What are cross validation and bootstrapping?"

Sunday, July 5, 2009

Implementing SVM on large dataset

Support vector machines (SVM) are a set of related supervised learning methods used for classification and regression. Our aim is to compare various free implementation of SVM, in terms of accuracy and computation time. Indeed, because the heuristic nature of the algorithm, we can obtain different results according to the used tools on the same dataset. In fact, in the publications describing the performance of SVM, we should not only specify the parameters of the algorithm but also indicate what is the tool used. This latter can influence the results.

SVM is effective in domains with very high number of predictive variables, when the ratio between the number of variables and the number of observations is unfavorable. We are in a domain which is particularly favorable to SVM in this tutorial. We want to discriminate two families of proteins from their description with amino acids. We use sequence of 4 characters (4-grams) as descriptors. Thus, we can have a large number of descriptors (31,809) in comparison to the number of examples (135 instances).

We compare Tanagra 1.4.27, Orange 1.0b2, Rapidminer Community Edition 4.2 and Weka 3.5.6.

Keywords: svm, support vector machine
Components: C-SVC, SVM, SUPERVISED LEARNING, CROSS-VALIDATION
Tutorial: en_Tanagra_Perfs_Comp_SVM.pdf
Dataset: wide_protein_classification.zip
Reference:
Wikipedia (en), « Support vector machine »

Wednesday, July 1, 2009

Self-organizing map (SOM)

A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.

In this tutorial, we show how to implement the Kohonen's SOM algorithm with Tanagra. We try to assess the properties of this approach by comparing the results with those of the PCA algorithm. Then, we compare the results to those of K-Means, which is a clustering algorithm. Finally, we implement the Two-step Clustering process by combining the SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering where we combine K-Means and HAC. We observe that the HAC primarily merges the adjacent cells.

Keywords: Kohonen, self organizing map, SOM, clustering, dimensuionality reduction, k-means, hierarchical agglomerative clustering, hac, two-step clustering
Components: UNIVARIATE CONTINUOUS STAT, UNIVARIATE OUTLIER DETECTION, KOHONEN-SOM, PRINCIPAL COMPONENT ANALYSIS, SCATTERPLOT, K-MEANS, CONTINGENCY CHI-SQUARE, HAC
Tutorial: en_Tanagra_Kohonen_SOM.pdf
Dataset: waveform_unsupervised.xls
Reference:
Wikipedia, « Self organizing map », http://en.wikipedia.org/wiki/Self-organizing_map