Tanagra - Data Mining and Data Science Tutorials: November 2009

Monday, November 30, 2009

Parametric tests for comparing populations

Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples.

The tests for comparison of population try to determine if K (K >= 2) samples come from the same underlying population according to a variable of interest (X). We talk parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the distribution of the data is Gaussian, the hypothesis testing relies on mean or on variance.

We handle univariate test in this tutorial i.e. we have only one variable of interest. When we want to analyze simultaneously several variables, we talk about multivariate test.

Keywords: t-test, F-Test, Bartlett's test, Levene's test, Brown-Forsythe's test, independent samples, dependent samples, paired samples, matched-pairs samples, anova, welch's anova, randomized complete blocks
Components: MORE UNIVARIATE CONT STAT, NORMALITY TEST, T-TEST, T-TEST UNEQUAL VARIANCE, ONE-WAY ANOVA, WELCH ANOVA, FISHER’S TEST, BARTLETT’S TEST, LEVENE’S TEST, BROWN-FORSYTHE TEST, PAIRED T-TEST, PAIRED V-TEST, ANOVA RANDOMIZED BLOCKS
Tutorial: en_Tanagra_Univariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References:
NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ (Chapter 7, Product and Process Comparisons)

Thursday, November 26, 2009

Three curves for classifier assessment

Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier. On one hand we have the confusion matrix and associated indicators, very popular into the academic publications. On the other hand, in the real applications, the users prefers some curves which seem very mysterious for people outside the domain (e.g. ROC curve for the epidemiologists, gain chart or cumulative lift curve in the marketing domain, precision recall curve in the information retrieval domain, etc.).

In this tutorial, we give first the details of the calculation of these curves by creating them "at the hand" in a spreadsheet. Then, we use Tanagra 1.4.33 and R 2.9.2 for obtaining them. We use these curves for the comparison the performances of the logistic regression and support vector machine (Radial Basis Function kernel).

Keywords: roc curve, gain chart, precision recall curve, lift curve, logistic regression, support vector machine, svm, radial basis function kernel, rbf kernel, e1071 package, R software, glm

Components: DISCRETE SELECT EXAMPLES, BINARY LOGISTIC REGRESSION, SCORING, C-SVC, ROC CURVE, LIFT CURVE, PRECISION-RECALL CURVE
Tutorial: en_Tanagra_Spv_Learning_Curves.pdf
Dataset : heart_disease_for_curves.zip

Sunday, November 22, 2009

Tanagra - Version 1.4.34

A component of induction of predictive rules (RULE INDUCTION) was added under "Supervised Learning" tab. Its use is described in a tutorial available online (will be translated soon).

The DECISION LIST component has been improved, we changed the test done during the pre-pruning process. The formula is described in the tutorial above.

The SAMPLING and STRATIFIED SAMPLING components (Instance Selection tab) have been slightly modified. It is now possible to set ourself the seed number of the pseudorandom number generator.

Following an indication of Anne Viallefont, calculation of degrees of freedom in tests on contingency tables is now more generic. Indeed, the calculation was wrong when the database was filtered and some margins (row or column) contained a number equal to zero. Anne, thank you for this information. More generally, thank you to everyone who sent me comments. Programming has always been for me a kind of leisure. The real work starts when it is necessary to check the results, compare them with the available references, cross them with other data mining tools, free or not, understand the possible differences, etc.. At this step, your help is really valuable.

Monday, November 9, 2009

Handling Missing values in SIPINA

Dealing with missing values is a difficult problem. The programming in itself is not a problem; we just report the missing value by a specific code. In contrast, the treatment before or during data analysis is very complicated.

Various techniques are available in order to handle missing values into SIPINA. In this tutorial, we show how to implement them; and what are their consequences on the decision tree learning context (C4.5 algorithm; Quinlan, 1993).

Keywords: missing value, missing data, listwise deletion, casewise deletion, data imputation, C4.5, decision tree
Tutorial: en_Sipina_Missing_Data.pdf
Dataset: ronflement_missing_data.zip
References:
P.D. Allison, « Missing Data », in Quantitative Applications in the Social Sciences Series n°136, Sage University Paper, 2002.
J. Bernier, D. Haziza, K. Nobrega, P. Whitridge, « Handling Missing Data – Case Study », Statistical Society of Canada.
D. Garson, "Data Imputation for Missing Values"

Wednesday, November 4, 2009

Model deployment with Sipina

Model deployment is the last step of the Data Mining process. In its simplest form in a supervised learning task, it consists in to apply a predictive model on unlabeled cases.

Applying the model on unseen cases is a very useful functionality. But it would be even more interesting if we could announce its accuracy. Indeed, a misclassification can have dramatic consequences. We must measure the risk we take when we make decisions from a predictive model. An indication about the performance of a classifier is important when we decide or not to deploy it.

In this tutorial, we show how to apply a classifier on unlabeled sample with Sipina. We show also how to estimate the generalization error rate using a resampling scheme such as bootstrap.

Keywords: model deployment, unseen cases, unlabeled instances, decision tree, sipina, linear discriminant analysis

Tutorial: en_sipina_deployment.pdf

Dataset: wine_deployment.xls

References:

Tanagra Tutorials, "Applying a classifier on a new dataset (Deployment)"

Tuesday, November 3, 2009

Sipina - Supported file format

The data access is the first step of the data mining process. It is a crucial step. It is one of the main criteria used when we want to assess the quality of a tool. If we do not able to load a dataset, we cannot perform any kind of analysis. The software is not useable. If the data access is not easy and requires complicated operations, we will devote less time to the other steps of the data exploration.

The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with 4,817,099 instances and 42 variables.

Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.

Keywords: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm
Tutorial: en_Sipina_File_Format.pdf
Dataset: weather.txt and kdd-cup-discretized-descriptors.txt.zip