Tuesday, December 2, 2014

Clustering of categorical variables (slides)

The aim of clustering of categorical variables is to group variables according to their relationship. The variables in the same cluster are highly related; variables in different clusters are weakly related. In these slides, we describe an approach based on the Cramer’s V measure of association. We observe that the approach can highlight subset of variables which is useful - for instance - in a variable selection process for a subsequent supervised learning task. But, on the other hand, we have no indication about the nature of these associations. The interpretation of the groups is not obvious.

This leads us to deepen the analysis and to take an interest in the clustering of the categories of nominal variables. An approach based on a measure of similarity between categories using the indicator variables (dummy variables) is described. Other approaches are also reviewed. The main advantage of this kind of analysis (clustering of categories) is that we can easily interpret the underlying nature of the groups.

Keywords: categorical variables, qualitative variables, categories, clustering, clustering variables, latent variable, cramer's v, dice's index, clusters, groups, bottom-up, hierarchical agglomerative clustering, hac, top down, mca, multiple correspondence analysis
Components (Tanagra): CATVARHCA
Slides: Clustering of categorical variables
References:
H. Abdallah, G. Saporta, « Classification d’un ensemble de variables qualitatives » (Clustering of a set of categorical variables), in Revue de Statistique Appliquée, Tome 46, N°4, pp. 5-26, 1998.
F. Harrell Jr, « Hmisc: Harrell Miscellaneous », version 3.14-5.