Thursday, February 27, 2014

Introduction to Decision Trees

Here are the lecture notes I use for my course “Introduction to Decision Trees”. The basic concepts of the decision tree algorithm are described. The underlying method is rather similar to the CHAID approach.

Keywords: machine learning, supervised methods, decision tree learning, classification tree
Slides: Introduction to Decision Trees
T. Mitchell, "Decision Tree Learning", in "Machine Learning", McGraw Hill, 1997; Chapter 3, pp. 52-80.
L. Rokach, O. Maimon, "Decision Trees", in  "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 9, pp. 165-192.

Saturday, February 22, 2014

Introduction to Supervised Learning

Here are the lecture notes I use for my course “Introduction to Supervised Learning”. The presentation is very simplified. But, all the important elements are described: the goal of the supervised learning process, the Bayes rule, the evaluation of the models using the confusion matrix.

Keywords: machine learning, supervised methods, model, classifier, target attribute, class attribute, input attributes, descriptors, bayes rule, confusion matrix, error rate, sensitivity, precision, specificity
Slides: Introduction to Supervised Learning
O. Maimon, L. Rokach, "Introduction to Supervised Methods", in  "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 8, pp. 149-164.
T. Hastie, R. Tibshirani, J. Friedman, "The elements of Statistical Learning", Springer, 2009.

Wednesday, February 5, 2014

Cluster analysis for mixed data

The aim of clustering is to gather together the instances of a dataset in a set of groups. The instances in the same cluster are similar according a similarity (or dissimilarity) measure. The instances in distinct groups are different. The influence of the used measure, which is often a distance measure, is essential in this process. They are well known when we work on attributes with the same type. The Euclidian distance is often used when we deal with numeric variables; the chi-square distance is more appropriate when we deal with categorical variables. The problem is a lot of more complicated when we deal with a set of mixed data i.e. with both numeric and categorical values. It is admittedly possible to define a measure which handles simultaneously the two kinds of variables, but we have trouble with the weighting problem. We must define a weighting system which balances the influence of the attributes, indeed the results must not depend of the kind of the variables. This is not easy .

Previously we have studied the behavior of the factor analysis for mixed data (AFDM in French). This is a generalization of the principal component analysis which can handle both numeric and categorical variables . We can calculate, from a set of mixed variables, components which summarize the information available in the dataset. These components are a new set of numeric attributes. We can use them to perform the clustering analysis based on standard approaches for numeric values.

In this paper, we present a tandem analysis approach for the clustering of mixed data. First, we perform a factor analysis from the original set of variables, both numeric and categorical. Second, we launch the clustering algorithm on the most relevant factor scores. The main advantage is that we can use any type of clustering algorithm for numeric variables in the second phase. We expect also that by selecting a few number of components, we use the relevant information from the dataset, the results are more reliable .

We use Tanagra 1.4.49 and R (ade4 package) in this case study.

Keywords: AFDM, FAMD, factor analysis for mixed data, clustering, cluster analysis, hac, hierarchical agglomerative clustering, R software, hclust, ade4 package, dudi.mix, cutree, groups description
Tutorial: en_Tanagra_Clustering_Mixed_Data.pdf
Tanagra, "Factor Analysis for Mixed Data".
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.