Tanagra - Data Mining and Data Science Tutorials: Tanagra

Saturday, December 1, 2012

Tanagra - Version 1.4.48

New components have been added.

K-Means Strengthening. This component was suggested to me by Mrs. Claire Gauzente. The idea is to strengthen an existing partition (e.g. from a HAC) by using K-Means algorithm. A comparison of groups before and after optimization is proposed, indicating the efficiency of the optimization. The approach can be plugged to all clustering algorithm into Tanagra. Thanks to Claire for this valuable idea.

Discriminant Correspondence Analysis. This is an extension of the canonical discriminant analysis to discrete attributes (Hervé Abdi, 2007). The approach is based on a clever transformation of the dataset. The initial dataset is transformed into a crosstab. The values of the target attribute are in row, all the values of the input attributes are in column. The algorithm performs a correspondence analysis to this new data table to identify the associations between the values of the target and the input variables. Thus, we dispose of the tools of the correspondence analysis for a comprehensive reading of the results (factor scores, contributions, quality of representation).

Other components have been improved.

HAC. After the choice of the number of groups in the dendrogram in the Hierarchical Agglomerative Clustering, a last pass on the data is performed, it assigns each individual of the learning sample into the group with the nearest centroid. Thus, there may be discrepancy between the number of instances displayed on the tree nodes and the number of individuals in the groups. Tanagra displays the two partitions. Only the last one is used when Tanagra applies the clustering model on new instances, when it computes conditional statistics, etc.

Correspondence Analysis. Tanagra now provides the coefficients of the factor score functions for supplementary columns and rows in the factorial correspondence analysis. Thus, it will be possible to easily calculate the factor scores of new points described by their row or column profile. Finally, the results tables can be sorted according to contributions to the factors of the modalities.

Multiple correspondence analysis. Several improvements have been made to the multiple correspondence analysis: the component knows how to take into account supplementary continuous and discrete variables; the variables can be sorted according to their contribution to the factors; all indicators for the interpretation can be brought together in a single large table for a synthetic visualization of the results, this feature is especially interesting if we have a small number of factors; the coefficients for the factor score functions are provided, we can easily calculate the factorial coordinates of the supplementary individuals apart from Tanagra.

Some tutorials will come soon to describe the use of these components on realistic case studies.

Download page : setup