Wednesday, February 5, 2014

Cluster analysis for mixed data

The aim of clustering is to gather together the instances of a dataset in a set of groups. The instances in the same cluster are similar according a similarity (or dissimilarity) measure. The instances in distinct groups are different. The influence of the used measure, which is often a distance measure, is essential in this process. They are well known when we work on attributes with the same type. The Euclidian distance is often used when we deal with numeric variables; the chi-square distance is more appropriate when we deal with categorical variables. The problem is a lot of more complicated when we deal with a set of mixed data i.e. with both numeric and categorical values. It is admittedly possible to define a measure which handles simultaneously the two kinds of variables, but we have trouble with the weighting problem. We must define a weighting system which balances the influence of the attributes, indeed the results must not depend of the kind of the variables. This is not easy .

Previously we have studied the behavior of the factor analysis for mixed data (AFDM in French). This is a generalization of the principal component analysis which can handle both numeric and categorical variables . We can calculate, from a set of mixed variables, components which summarize the information available in the dataset. These components are a new set of numeric attributes. We can use them to perform the clustering analysis based on standard approaches for numeric values.

In this paper, we present a tandem analysis approach for the clustering of mixed data. First, we perform a factor analysis from the original set of variables, both numeric and categorical. Second, we launch the clustering algorithm on the most relevant factor scores. The main advantage is that we can use any type of clustering algorithm for numeric variables in the second phase. We expect also that by selecting a few number of components, we use the relevant information from the dataset, the results are more reliable .

We use Tanagra 1.4.49 and R (ade4 package) in this case study.

Keywords: AFDM, FAMD, factor analysis for mixed data, clustering, cluster analysis, hac, hierarchical agglomerative clustering, R software, hclust, ade4 package, dudi.mix, cutree, groups description
Components: AFDM, HAC, GROUP CHARACTERIZATION, SCATTERPLOT 
Tutorial: en_Tanagra_Clustering_Mixed_Data.pdf
Dataset: bank_clustering.zip
References:
Tanagra, "Factor Analysis for Mixed Data".
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.