Tanagra - Data Mining and Data Science Tutorials: March 2013

Sunday, March 31, 2013

Factor Analysis for Mixed Data

Usually, as a factor analysis approach, we use the principal component analysis (PCA) when the active variables are quantitative; the multiple correspondence analysis (MCA) when they are all categorical. But what to do when we have a mix of these two types of variables?

A possible strategy is to discretize the quantitative variables and use the MCA. But this procedure is not recommended if we have a small dataset (a few number of instances), or if the number of qualitative variables is low in comparison with the number of quantitative ones. In addition, the discretization implies a loss of information. The choice of the number of intervals and the calculation of the cut points are not obvious.
Another possible strategy is to replace each qualitative variable by a set of dummy variables (a 0/1 indicator for each category of the variable to recode). Then we use the PCA. This strategy has a drawback. Indeed, because the dispersions of the variables (the quantitative variables and the indicator variables) are not comparable, we will obtain biased results.

The Jérôme Pages' "Multiple Factor Analysis for Mixed Data" (2004) [AFDM in French] relies on this second idea. But it introduces an additional refinement. It uses dummy variables, but instead of the 0/1, it uses the 0/x values, where 'x' is computed from the frequency of the concerned category of the qualitative variable. We can therefore use a standard program for PCA to lead the analysis (Pages, 2004; page 102). The calculation process is thus well controlled. But the interpretation of the results requires a little extra effort since it will be different depending on whether we study the role of a quantitative or qualitative variable.

In this tutorial, we show how to perform an AFDM with Tanagra 1.4.46 and R 1.15.1 (FactoMinerR package). We emphasize the reading of the results. We must study simultaneously the influence of quantitative and qualitative variables for the interpretation of the factors.

Keywords: PCA, principal component analysis, MCA, multiple correspondence analysis, AFDM, correlation, correlation ratio, FactoMineR package, R osftware
Components: AFDM, SCATTERPLOT WITH LABEL, CORRELATION SCATTERPLOT, VIEW MULTIPLE SCATTERPLOT
Tutorial: en_Tanagra_AFDM.pdf
Dataset: AUTOS2005AFDM.txt
References :
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.

Saturday, March 2, 2013

Correspondence Analysis - Tools comparison

The correspondence analysis (or factorial correspondence analysis) is an exploratory technique which enables to detect the salient associations in a two-way contingency table. It proposes an attractive graphical display where the rows and the columns of the table are depicted as points. Thus, we can visually identify the similarities and the differences between the rows profiles (between the columns profiles). We can also detect the associations between rows and columns.

The correspondence analysis (CA) can be viewed as an approach to decompose the chi-squared statistic associated with a two-way contingency table into orthogonal factors. In fact, because CA is a descriptive technique, it can be applied to tables even if the chi-square test of independence is not appropriate. The only restriction is that the table must contain positive or zero values, the calculating the sum of the rows and the columns is possible, the rows and columns profiles can be interpreted.

The correspondence analysis can be viewed as a factorial technique. Factors are latent variables defined from linear combinations of the rows profiles (or columns profiles). We can use the factors scores coefficients to calculate the coordinate of supplementary rows or columns.

In this tutorial, we show how to implement the CA on a realistic dataset with various tools: Tanagra 1.4.48, which incorporates new features for a better reading of the results; R software, using the "ca" and "ade4" packages; OpenStat; and SAS (PROC CORRESP). We will see - as always - that all these software produce exactly the same numerical results (fortunately!). The differences are found mainly in terms of the organization of the outputs.

Keywords: correspondence analysis, symmetric graph, R software, package ca, package ade4, openstat, sas
Components: CORRESPONDENCE ANALYSIS
Tutorial: en_Tanagra_Correspondence_Analysis.pdf
Dataset: statements_foods.zip
References :
M. Bendixen, « A practical guide to the use of the correspondence analysis in marketing research », Marketing Research On-Line, 1 (1), pp. 16-38, 1996.
Tanagra Tutorial, "Correspondence Analysis".