Tanagra - Data Mining and Data Science Tutorials: 2014

Friday, December 12, 2014

Correlation analysis (slides)

The aim of the correlation analysis is to characterize the existence, the nature and the strength of the relationship between two quantitative variables. The visual inspection of scatter plots is a prime instrument in a first step, when we have no idea about the form of the underlying relationship between the variables. But, in second step, we need statistical tools to measure the strength of the relationship and to assess its significance.

In these slides, we present the Pearson's product-moment correlation. We show how to estimate its value using a sample. We present the inferential tools which enable to realize hypothesis testing and confidence interval estimation.

But the Pearson correlation is appropriate only to characterize linear relationship. We study the possible solutions for problematic situations with, among others, the Spearman's rank correlation coefficient (Spearman's rho).

Last, the partial correlation coefficient and the related inferential tools are described.

Keywords: correlation, partial correlation, pearson, spearman, hypothesis testing, significance, confidence interval
Components (Tanagra): LINEAR CORRELATION
Slides: Correlation analysis
References:
M. Plonsky, “Correlation”, Psychological Statistics, 2014.

Tuesday, December 2, 2014

Clustering of categorical variables (slides)

The aim of clustering of categorical variables is to group variables according to their relationship. The variables in the same cluster are highly related; variables in different clusters are weakly related. In these slides, we describe an approach based on the Cramer’s V measure of association. We observe that the approach can highlight subset of variables which is useful - for instance - in a variable selection process for a subsequent supervised learning task. But, on the other hand, we have no indication about the nature of these associations. The interpretation of the groups is not obvious.

This leads us to deepen the analysis and to take an interest in the clustering of the categories of nominal variables. An approach based on a measure of similarity between categories using the indicator variables (dummy variables) is described. Other approaches are also reviewed. The main advantage of this kind of analysis (clustering of categories) is that we can easily interpret the underlying nature of the groups.

Keywords: categorical variables, qualitative variables, categories, clustering, clustering variables, latent variable, cramer's v, dice's index, clusters, groups, bottom-up, hierarchical agglomerative clustering, hac, top down, mca, multiple correspondence analysis
Components (Tanagra): CATVARHCA
Slides: Clustering of categorical variables
References:
H. Abdallah, G. Saporta, « Classification d’un ensemble de variables qualitatives » (Clustering of a set of categorical variables), in Revue de Statistique Appliquée, Tome 46, N°4, pp. 5-26, 1998.
F. Harrell Jr, « Hmisc: Harrell Miscellaneous », version 3.14-5.

Wednesday, November 19, 2014

Discretization of continuous attributes (slides)

The discretization consists to transform a continuous attribute into a discrete (ordinal) attribute. The process determines a finite number of intervals from the available values, for which discrete numerical values are assigned. The two main issues of the process are: how to determine the number of intervals; how to determine the cut points.

In this slides, we present some discretization methods for the unsupervised and supervised contexts.

Keywords: discretization, data preprocessing, chi-merge, mdlpc, equal-frequency, equal-width, clustering, top-down, bottom-up, feature construction
Components (Tanagra): EQFREQ DISC, EQWIDTH DISC, MDLPC, BINARY BINNING, CONT TO DISC
Slides: Discretization
Tutorials:
Tanagra Tutorials, "Discretization of continuous features", may 2010.

Wednesday, September 24, 2014

Clustering variables (slides)

The aim of clustering variables is to divide a set of numeric variables into disjoint clusters (subset of variables). In these slides, we present an approach based on the concept of latent component. A subset of variables is summarized by a latent component which is the first factor from the principal component analysis. This is a kind of "centroid" variable which maximizes the sum of the squared correlation with the existing variables. Various clustering algorithms based on this idea are described: a hierarchical agglomerative algorithm; a top down approach; and an approach which is inspired by the k-means method.

Keywords: clustering, clustering variables, latent variable, latent component, clusters, groups, bottom-up, hierarchical agglomerative clustering, top down, varclus, k-means, pca, principal component analysis
Components (Tanagra): VARHCA, VARKMEANS, VARCLUS
Slides: Clustering variables
Tutorials:
Tanagra tutorials, "Variable clustering (VARCLUS)", 2008.

Tuesday, September 16, 2014

Single layer and multilayer perceptrons (slides)

Artificial neural networks are computational models inspired by an animal’s central nervous system (in particular brain) which is capable of machine learning as well as pattern recognition (Wikipedia).

In these slides, we present the single layer and multilayer perceptrons, which are devoted to supervised learning process. We describe the baseline of the approaches: the difference between the linear (single-layer) and non-linear (multilayer) classifiers; the representation power of the models; the learning algorithm (the Widrow-Hoff rule and the back propagation algorithm).

Keywords: artificial neural network, perceptron, single layer, SLP, multilayer, MLP, widrow-hoff rule, backpropagation algorithm, linear classifier, non linear classifier
Components (Tanagra): MULTILAYER PERCEPTRON
Slides: Single layer and multilayer perceptrons
Tutorials:
Tanagra tutorials, "Configuration of a multilayer perceptron", December 2017.
Tanagra tutorials, "Multilayer perceptron - Software comparison", 2008.

Saturday, September 13, 2014

Filter approaches for feature selection (slides)

In the supervised learning context, the filter approach for feature selection consists in the selection of the most appropriate variables for any subsequent machine learning algorithm used for the construction of the model.

The methods are mostly based on the correlation concept (in a large sense). They are interesting because they enable to handle quickly high-dimensional data sets. On the other hand, they are questionable because they do not take into account the characteristics of the model (e.g. linear, non-linear) that will be developed from the selected variables.

Keywords: feature selection, filter methods, embedded methods, wrapper methods
Components (Tanagra): CFS FILTERING, FCBF FILTERING, MIFS FILTERING, MODTREE FILTERING, FEATURE RANKING, FISHER FILTERING, RUNS FILTERING, STEPDISC
Slides: Filter methods
Tutorials:
Tanagra tutorials, "Filter methods for feature selection", 2010.
Tanagra tutorials, "Filter methods for feature selection (continuation)", 2010.

Sunday, August 31, 2014

Association rule learning (slides)

Association rule learning is a popular approach to extract rules from large databases. Initially intended to transactional data, especially for the market basket analysis, the method can be applied to any binary or binarized data.

In these slides, we show the outline of the approach. We present a basic algorithm to generate association rules from data. We highlight the influence of the settings (minimum support and minimum confidence) for the reduction of the search space, and thus for the reduction of the amount of calculations.

Keywords: association rule, association rules, itemset, frequent itemset, eclat algorithm, support, confidence, lift
Components (Tanagra): A PRIORI, A PRIORI MR, A PRIORI PT, FREQUENT ITEMSETS, SPV ASSOC RULE, SPV ASSOC TREE
Slides: Association rule learning
References:
Wikipedia, "Association Rule Learning".
M. Zaki, S. Parthasaraty, M. Ogihara, W. Li, “New Algorithms for Fast Discovery of Association Rules”, in Proc. of KDD’97, p. 283-296, 1997.

Tuesday, August 12, 2014

ROC curve (slides)

The ROC curve is a graphical tool for the evaluation and comparison of binary classifiers. It provides more complete evaluation than the confusion matrix and the error rate. It is valid even if we deal with a non-representative test set i.e. the observed class frequencies are not an estimate of the prior class probabilities. It is especially useful when we deal with class imbalance, and when the misclassification costs matrix is not well established.

In these slides, we show: the ideas underlying the ROC curve; the construction of the curve from a dataset; the calculation of the AUC (area under curve), a synthetic indicator derived from the ROC curve; and the use of the ROC curve for model comparison.

Keywords: receiver operating characteristic, roc curve, auc, area under curve, binary classifier, evaluation, model comparison, class probability estimate, score
Components (Tanagra): SCORING, ROC CURVE
Slides: ROC curve
References:
Wikipedia, "Receiver Operating Characteristic".
T. Fawcett, "An introduction to ROC analysis", Pattern Recognition Letters, 27, 861-874, 2009.

Monday, August 4, 2014

Customer targeting (slides)

Customer targeting is one component of the direct marketing. The aim is to identify the customers which are the most interested in a new product. We are in the data mining context because we create a classifier from a learning sample. But we do not want to classify the instances. We want to measure the probability of the individuals to buy the product i.e. their score, their propensity to purchase. In this context, we use a specific tool - the gain chart (or the cumulative lift curve) - to assess the efficiency of the analysis.

In these slides, we detail the overall process. We emphasize the reading of the gain chart, especially the transposition of the reading of the chart from a labeled sample to the customer database (for which we do not know the values of the target attribute).

Keywords: customer targeting, direct marketing, scoring, score, propensity to purchase
Components (Tanagra): SCORING, LIFT CURVE
Slides: Customer targeting
References:
Microsoft, “Lift chart (Analysis Services – Data Mining)”, SQL Server 2014.
H. Hamilton, “Cumulative Gains and Lift Charts”, in CS 831 – Knowledge Discovery in Databases, 2012.

Saturday, August 2, 2014

Descriptive discriminant analysis (slides)

The descriptive discriminant analysis (DDA) or canonical discriminant analysis is a statistical approach which performs a multivariate characterization of differences between groups. It is related to other factorial approaches such as principal component analysis or canonical correlation analysis.

In these slides, we show the main issues of the approach, and the reading of the results. We show also how the discriminant analysis is related to the predictive discriminant analysis (linear discriminant analysis) which, yet, relies on restrictive statistical assumptions.

Keywords: discriminant analysis, descriptive discriminant analysis, canonical discriminant analysis, predictive discriminant analysis, correlation ratio, R, lda package MASS, sas, proc candisc
Components (Tanagra): CANONICAL DISCRIMANT ANALYSIS
Slides: DDA
Dataset: wine_quality.xls
References:
SAS, "CANDISC procedure".

Friday, July 11, 2014

Clustering tree (slides)

The clustering tree algorithm is both a clustering approach and a multi-objective supervised learning method.

In the cluster analysis framework, the aim is to group objects in clusters, where the objects in the same cluster are similar in a certain sense. The clustering tree algorithm enables to perform this kind of task. We obtain a decision tree as a clustering structure. Thus, the deployment of the classification rule in the information system is really easy.

But we can also consider the clustering tree as an extension of the classification/regression tree because we can distinguish two set of variables: the explained (active) variables which are used to determine the similarities between the objects; the predictive (illustrative) variables which allows to describe the groups.

In this slides, we show the main features of this approach.

Keywords: cluster analysis, clustering, clustering tree, groups characterization
Slides: Clustering tree
References :
M. Chavent (1998), « A monothetic clustering method », Pattern Recognition Letters, 19, 989—996.
H. Blockeel, L. De Raedt, J. Ramon (1998), « Top-Down Induction of Clustering Trees », ICML, 55—63.

Monday, May 19, 2014

Sipina - Version 3.12

The transfer between the Excel spreadsheet and Sipina was improved on the databases of moderate size (on large databases, several hundreds of thousands of rows, it is better to perform a direct importation of data file in text format TXT). The management of the decimal point has been improved. Now, the automatic processing is much faster than before.

The precision of the numerical cut points displayed in a decision tree becomes customizable. The users dispose of a new item into the menu "Tree Management".

Sipina website: Sipina
Download: Setup file

Saturday, May 3, 2014

Binary classification via regression (slides)

In these slides, we study the analogy between the linear discriminant analysis and the regression on an indicator variable when we deal with a binary classification problem.

The tests for the global significance of the model and the individual significance of the coefficients are equivalent. The coefficients are proportional, including the intercepts when we treat the balanced case. In the other case (unbalanced classes), an additional correction of the regression intercept is needed to obtain the linear discriminant analysis intercept.

For the multiclass classification, the equivalence between the regression and the linear discriminant analysis is no longer valid.

Keywords: supervised learning, linear discriminant analysis, multiple linear regression, R2, wilks lambda
Slides: Classification via regression
References :
R.O. Duda, P.E. Hart, D. Stork, « Pattern Classification », 2nd Edition, Wiley, 2000.
C.J. Huberty, S. Olejnik, « Applied MANOVA and Discriminant Analysis »,Wiley, 2006.

Friday, March 28, 2014

Naive Bayes classifier (slides)

Here are the slides I use for my course about “Naive Bayes Classifier”. The main originality of this presentation is that I show it is possible to extract an explicit model based on the calculations of the conditional distributions. This makes highly easier the deployment of the model in real case studies. This aspect is commonly unknown. The feature selection problem in the context of Naive Bayes Classifier learning is also highlighted.

Keywords: machine learning, supervised methods, naive bayes, independence assumption, independent feature model, feature selection, cfs
Slides: Naive Bayes classifier
References:
T. Mitchell, "Generative and discriminative classifiers: naive bayes and logistic regression", in "Machine Learning", McGraw Hill, 2010; Draft of January 2010.
Wikipedia, "Naive Bayes classifier".

Friday, March 14, 2014

Linear discriminant analysis (slides)

Here are the slides I use for my course about “Linear Discriminant Analysis” (LDA). The two main assumptions which enable to obtain a linear classifier are highlighted. The LDA is very interesting because we can interpret the classifier in different ways: it is a parametric method based on the MAP (maximum a posteriori) decision rule; it is a classifier based on a distance to the conditional centroids; it is a linear separator which defines various regions in the representation space.

Statistical tools for the overall model evaluation and the checking of the relevance of the predictive variables are presented.

Keywords: machine learning, supervised methods, discriminant analysis, predictive discriminant analysis, linear discriminant analysis, linear classification functions, wilks lambda, stepdisc, feature selection
Slides: linear discriminant analysis
References:
J. Gareth, D. Witten, T. Hastie, R. Tibshirani, "An introduction to statistical learning with applications in R", Springer, 2013.
R. Duda, P. Hart, G. Stork, "Pattern Classification", Wiley, 2000.

Friday, March 7, 2014

Regression Trees

Here are the slides I use for my course about “Regression Trees”. Because this course comes after the one about “Decision Trees”, only the special features for the handling of a continuous target attribute are highlighted. The described algorithms correspond roughly to the AID and the CART approaches.

Keywords: machine learning, supervised methods, regression tree, aid, cart, continuous class attribute
Slides: regression trees
References:
L. Breiman, J. Friedman, R. Olshen and C. Stone, “Classification and Regression Trees”, Wadsworth Int. Group, 1984.
J. Morgan, J.A. Sonquist, "Problems in the Analysis of Survey Data and a Proposal", JASA, 58:415-435, 1963.

Saturday, March 1, 2014

Decision tree learning algorithms

Here are the slides I use for my course about the existing decision tree learning algorithms. Only the most popular ones are described: C4.5, CART and CHAID (a variant). The differences between these approaches are highlighted according: the splitting measure; the merging strategy during the splitting process; the approach for determining the right sized tree.

Keywords: machine learning, supervised methods, decision tree learning, classification tree, chaid, cart, c4.5
Slides: C4.5, CART and CHAID
References:
L. Breiman, J. Friedman, R. Olshen and C. Stone, “Classification and Regression Trees”, Wadsworth Int. Group, 1984.
G. Kass, “An exploratory technique for Investigating Large Quantities of Categorical Data”, Applied Statistics, 29(2), 1980, pp. 119-127.
R. Quinlan, “C4.5: Programs for machine learning”, Morgan Kaufman, 1993.

Thursday, February 27, 2014

Introduction to Decision Trees

Here are the lecture notes I use for my course “Introduction to Decision Trees”. The basic concepts of the decision tree algorithm are described. The underlying method is rather similar to the CHAID approach.

Keywords: machine learning, supervised methods, decision tree learning, classification tree
Slides: Introduction to Decision Trees
References:
T. Mitchell, "Decision Tree Learning", in "Machine Learning", McGraw Hill, 1997; Chapter 3, pp. 52-80.
L. Rokach, O. Maimon, "Decision Trees ", in "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 9, pp. 165-192.

Saturday, February 22, 2014

Introduction to Supervised Learning

Here are the lecture notes I use for my course “Introduction to Supervised Learning”. The presentation is very simplified. But, all the important elements are described: the goal of the supervised learning process, the Bayes rule, the evaluation of the models using the confusion matrix.

Keywords: machine learning, supervised methods, model, classifier, target attribute, class attribute, input attributes, descriptors, bayes rule, confusion matrix, error rate, sensitivity, precision, specificity
Slides: Introduction to Supervised Learning
References:
O. Maimon, L. Rokach, "Introduction to Supervised Methods", in "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 8, pp. 149-164.
T. Hastie, R. Tibshirani, J. Friedman, "The elements of Statistical Learning", Springer, 2009.

Wednesday, February 5, 2014

Cluster analysis for mixed data

The aim of clustering is to gather together the instances of a dataset in a set of groups. The instances in the same cluster are similar according a similarity (or dissimilarity) measure. The instances in distinct groups are different. The influence of the used measure, which is often a distance measure, is essential in this process. They are well known when we work on attributes with the same type. The Euclidian distance is often used when we deal with numeric variables; the chi-square distance is more appropriate when we deal with categorical variables. The problem is a lot of more complicated when we deal with a set of mixed data i.e. with both numeric and categorical values. It is admittedly possible to define a measure which handles simultaneously the two kinds of variables, but we have trouble with the weighting problem. We must define a weighting system which balances the influence of the attributes, indeed the results must not depend of the kind of the variables. This is not easy .

Previously we have studied the behavior of the factor analysis for mixed data (AFDM in French). This is a generalization of the principal component analysis which can handle both numeric and categorical variables . We can calculate, from a set of mixed variables, components which summarize the information available in the dataset. These components are a new set of numeric attributes. We can use them to perform the clustering analysis based on standard approaches for numeric values.

In this paper, we present a tandem analysis approach for the clustering of mixed data. First, we perform a factor analysis from the original set of variables, both numeric and categorical. Second, we launch the clustering algorithm on the most relevant factor scores. The main advantage is that we can use any type of clustering algorithm for numeric variables in the second phase. We expect also that by selecting a few number of components, we use the relevant information from the dataset, the results are more reliable .

We use Tanagra 1.4.49 and R (ade4 package) in this case study.

Keywords: AFDM, FAMD, factor analysis for mixed data, clustering, cluster analysis, hac, hierarchical agglomerative clustering, R software, hclust, ade4 package, dudi.mix, cutree, groups description
Components: AFDM, HAC, GROUP CHARACTERIZATION, SCATTERPLOT
Tutorial: en_Tanagra_Clustering_Mixed_Data.pdf
Dataset: bank_clustering.zip
References:
Tanagra, "Factor Analysis for Mixed Data".
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.

Wednesday, January 15, 2014

Scilab and R - Performance comparison

We have studied the Scilab tool in a data mining scheme in a previous tutorial . We noted that Scilab is well adapted for data mining. It is a credible alternative to R. But, we observed also that the available toolboxes for statistical processing and data mining are not very numerous compared to those of R. In this second tutorial, we evaluate the behavior of Scilab when we deal with a dataset with 500,000 instances and 22 attributes. We compare its performances with those of R. Two criteria are used: the memory occupation measured in the Windows task manager; the execution time at each step of the process.

It is not possible to obtain an exhaustive point of view. To delimit the scope of our study, we have specified a standard supervised learning scenario: loading a data file, building the predictive model with linear discriminant analysis approach, calculating the confusion matrix and resubstitution error rate. Of course, this study is incomplete. But it seems that Scilab is less efficient in the data management step. It is however quite efficient in the modeling step. This last assessment depends on the toolbox used.

Keywords: scilab, toolbox, nan, linear discriminant analysis, R software, sipina, tanagra
Tutorial: en_Tanagra_Scilab_R_Comparison.pdf
Dataset: waveform_scilab_r.zip
References:
Scilab - https://www.scilab.org/en
Michaël Baudin, "Introduction to Scilab (in French)", Developpez.com.

Tuesday, January 7, 2014

Data Mining with Scilab

I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical data processing and data mining. Recently a mathematician colleague spoke to me about this tool. He was surprised about the low visibility of Scilab within the data mining community, knowing that it proposes functionalities which are quite similar to those of R software. I confess that I did not know Scilab from this perspective. I decided to study Scilab by setting a basic goal: is it possible to perform simply a predictive analysis process with Scilab? Namely: loading a data file (learning sample), building a predictive model, obtaining a description of its characteristics, loading a test sample, applying the model on this second set of data, building the confusion matrix and calculating the test error rate.

We will see in this tutorial that the whole task has been completed successfully easily. Scilab is perfectly prepared to fulfill statistical treatments. But two small drawbacks appear during the catch in hand of Scilab: the library of statistical functions exists but it is not as comprehensive as that of R; their documentation is not very extensive at this time. However, I am very satisfied of this first experience. I discovered an excellent free tool, flexible and efficient, very easy to take in hand, which turns out a credible alternative to R in the field of data mining.

Keywords: scilab, toolbox, nan, libsvm, linear discriminant analysis, R software, predictive analytics
Tutorial : en_Tanagra_Scilab_Data_Mining.pdf
Dataset : data_mining_scilab.zip
References :
Scilab - https://www.scilab.org/fr
ATOMS : Homepage - http://atoms.scilab.org/

Thursday, January 2, 2014

Tanagra, tenth anniversary

First of all, let me introduce you to all my wishes of happiness, health and success for the year 2014 which begins.

For Tanagra, 2014 is of quite particular importance. 10 Years ago almost to the day, the first version of the software has been put on line. Designed originally as a tool for the students and researchers in the data mining domain, the project has changed a bit of nature in recent years. Today, Tanagra is an academic project which provides a point of access to the statistical and the data mining techniques. It is addressed to students, but also to the researchers of other areas (psychology, sociology, archeology, etc.). It allows, I hope, make it more attractive, more clear, the implementation of these techniques on real case studies.

This mutation was accompanied by a refocusing of my activity. The Tanagra software is still evolving (we are at version 1.4.50), new methods are added, existing components are regularly improved, but at the same time I put emphasis on the documentation in the form of books, training materials and tutorials. The underlying idea is very simple: understanding the ins and outs of the methods is the best way to learn how to use software which proposes them.

Over the past 5 years (2009/01/01 to 2013/12/31), my site gets 677 visits per day. The 10 countries that come most often are: France, Morocco, Algeria, Tunisia, United States, India, Canada, Belgium, United Kingdom and Brazil. The page of the materials for my data mining courses is the most visited (http://eric.univ-lyon2.fr/~ricco/cours/supports_data_mining.html; 99 visits per day, 6 minutes 35 seconds average time spent on the page). At the same time, I note with great satisfaction that the English pages are overall as much visited as those in French. I think that the effort to write documentation in English is fruitful.

I hope that this work will be useful for a long time, and that 2014 will be the opportunity of exchanges always so rewarding for everybody.

Ricco.