Tanagra - Data Mining and Data Science Tutorials: December 2012

Sunday, December 30, 2012

Discriminant Correspondence Analysis

The aim of the canonical discriminant analysis is to explain the belonging to pre-defined groups of instances of a dataset. The groups are specified by a dependent categorical variable (class attribute, response variable); the explanatory variables (descriptors, predictors, independent variables) are all continuous. So, we obtain a small number of latent variables which enable to distinguish as far as possible the groups. These new features, called factors, are linear combinations of the initial descriptors. The process is a valuable dimensionality reduction technique. But its main drawback is that it cannot be directly applied when the descriptors are discrete. Even if the calculations are possible if we recode the variables using dummy variables for instance, the interpretation of the results - which is one of the main goals of the canonical discriminant analysis - is not really obvious.

In this tutorial, we present a variant of the discriminant analysis which is applicable to discrete descriptors due to Hervé Abdi (2007) . The approach is based on a transformation of the raw dataset in a kind of contingency table. The rows of the table correspond to the values of the target attribute; the columns are the indicators associated to the predictors’ values. Thus, the author suggests to use a correspondence analysis, on the one hand, in order to distinguish the groups, and on the other hand, to detect the relevant relationships between the values of the target attribute and those of the explanatory variables. The author called its approach "discriminant correspondence analysis" because it uses a correspondence analysis framework to solve a discriminant analysis problem.

In what follows, we detail the use of the discriminant correspondence analysis with Tanagra 1.4.48. We use the example described in the Hervé Abdi's paper. The goal is to explain the origin of 12 wines (3 possible regions) using 5 descriptors related to characteristics assessed by professional tasters. In a second part (section 3), we reproduce all the calculations with a program written for R.

Keywords: canonical discriminant analysis, descriptive discriminant analysis, correspondence analysis, R software, xlsx package, ca package
Components: DISCRIMINANT CORRESPONDENCE ANALYSIS
Tutorial : Tutorial DCA
Dataset: french_wine_dca.zip
References:
H. Abdi, « Discriminant correspondence analysis », In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage. pp. 270-275, 2007.

Saturday, December 1, 2012

Tanagra - Version 1.4.48

New components have been added.

K-Means Strengthening. This component was suggested to me by Mrs. Claire Gauzente. The idea is to strengthen an existing partition (e.g. from a HAC) by using K-Means algorithm. A comparison of groups before and after optimization is proposed, indicating the efficiency of the optimization. The approach can be plugged to all clustering algorithm into Tanagra. Thanks to Claire for this valuable idea.

Discriminant Correspondence Analysis. This is an extension of the canonical discriminant analysis to discrete attributes (Hervé Abdi, 2007). The approach is based on a clever transformation of the dataset. The initial dataset is transformed into a crosstab. The values of the target attribute are in row, all the values of the input attributes are in column. The algorithm performs a correspondence analysis to this new data table to identify the associations between the values of the target and the input variables. Thus, we dispose of the tools of the correspondence analysis for a comprehensive reading of the results (factor scores, contributions, quality of representation).

Other components have been improved.

HAC. After the choice of the number of groups in the dendrogram in the Hierarchical Agglomerative Clustering, a last pass on the data is performed, it assigns each individual of the learning sample into the group with the nearest centroid. Thus, there may be discrepancy between the number of instances displayed on the tree nodes and the number of individuals in the groups. Tanagra displays the two partitions. Only the last one is used when Tanagra applies the clustering model on new instances, when it computes conditional statistics, etc.

Correspondence Analysis. Tanagra now provides the coefficients of the factor score functions for supplementary columns and rows in the factorial correspondence analysis. Thus, it will be possible to easily calculate the factor scores of new points described by their row or column profile. Finally, the results tables can be sorted according to contributions to the factors of the modalities.

Multiple correspondence analysis. Several improvements have been made to the multiple correspondence analysis: the component knows how to take into account supplementary continuous and discrete variables; the variables can be sorted according to their contribution to the factors; all indicators for the interpretation can be brought together in a single large table for a synthetic visualization of the results, this feature is especially interesting if we have a small number of factors; the coefficients for the factor score functions are provided, we can easily calculate the factorial coordinates of the supplementary individuals apart from Tanagra.

Some tutorials will come soon to describe the use of these components on realistic case studies.

Download page : setup