Tanagra - Data Mining and Data Science Tutorials: December 2014

Friday, December 12, 2014

Correlation analysis (slides)

The aim of the correlation analysis is to characterize the existence, the nature and the strength of the relationship between two quantitative variables. The visual inspection of scatter plots is a prime instrument in a first step, when we have no idea about the form of the underlying relationship between the variables. But, in second step, we need statistical tools to measure the strength of the relationship and to assess its significance.

In these slides, we present the Pearson's product-moment correlation. We show how to estimate its value using a sample. We present the inferential tools which enable to realize hypothesis testing and confidence interval estimation.

But the Pearson correlation is appropriate only to characterize linear relationship. We study the possible solutions for problematic situations with, among others, the Spearman's rank correlation coefficient (Spearman's rho).

Last, the partial correlation coefficient and the related inferential tools are described.

Keywords: correlation, partial correlation, pearson, spearman, hypothesis testing, significance, confidence interval
Components (Tanagra): LINEAR CORRELATION
Slides: Correlation analysis
References:
M. Plonsky, “Correlation”, Psychological Statistics, 2014.

Tuesday, December 2, 2014

Clustering of categorical variables (slides)

The aim of clustering of categorical variables is to group variables according to their relationship. The variables in the same cluster are highly related; variables in different clusters are weakly related. In these slides, we describe an approach based on the Cramer’s V measure of association. We observe that the approach can highlight subset of variables which is useful - for instance - in a variable selection process for a subsequent supervised learning task. But, on the other hand, we have no indication about the nature of these associations. The interpretation of the groups is not obvious.

This leads us to deepen the analysis and to take an interest in the clustering of the categories of nominal variables. An approach based on a measure of similarity between categories using the indicator variables (dummy variables) is described. Other approaches are also reviewed. The main advantage of this kind of analysis (clustering of categories) is that we can easily interpret the underlying nature of the groups.

Keywords: categorical variables, qualitative variables, categories, clustering, clustering variables, latent variable, cramer's v, dice's index, clusters, groups, bottom-up, hierarchical agglomerative clustering, hac, top down, mca, multiple correspondence analysis
Components (Tanagra): CATVARHCA
Slides: Clustering of categorical variables
References:
H. Abdallah, G. Saporta, « Classification d’un ensemble de variables qualitatives » (Clustering of a set of categorical variables), in Revue de Statistique Appliquée, Tome 46, N°4, pp. 5-26, 1998.
F. Harrell Jr, « Hmisc: Harrell Miscellaneous », version 3.14-5.