Tanagra - Data Mining and Data Science Tutorials: July 2011

Sunday, July 24, 2011

PLS Discriminant Analysis - A comparative study

PLS regression is a regression technique usually designed to predict the values taken by a group of Y variables (target variables, dependent variables) from a set of variables X (descriptors, independent variables). Initially defined for the prediction of continuous target variable, the PLS regression can be adapted to the prediction of one discrete variable - i.e. adapted to the supervised learning framework - in different ways . The approah is called "PLS Discriminant Analysis" in this context. It incorporates the valuable qualities that we know usually into this new framework: the ability to process a representation space with very high dimensionality, a large number of noisy and / or redundant descriptors.

This tutorial is the continuation of a precedent paper dedicated to the presentation of some variants of the PLS-DA. We describe the behavior of one of them (PLS-LDA - PLS Linear Discriminant Analysis) on a learning set where the number of descriptors is moderately high (278 descriptors) in relation to the number of instances (232 instances). Even if the number of descriptors is not really very high, we note in our experiment a valuable characteristic of the PLS approach: we can control the variance of the classifier by adjusting the number of latent variables.

To assess this idea, we compare the behavior of the PLS-LDA with state-of-the-art supervised learning methods such as K-nearest neighbors , SVM (Support Vector Machine from the LIBSVM library ), the Breiman's Random Forest approach , or the Fisher's Linear Discriminant Analysis .

Keywords: pls regression, linear discriminant analysis, supervised learning, support vector machine, SVM, random forest, nearest neighbor
Components: K-NN, PLS-LDA, BAGGING, RND TREE, C-SVC, TEST, DISCRETE SELECT EXAMPLES, REMOVE CONSTANT
Tutorial: en_Tanagra_PLS_DA_Comparaison.pdf
Dataset: arrhytmia.bdm
References :
S. Chevallier, D. Bertrand, A. Kohler, P. Courcoux, « Application of PLS-DA in multivariate image analysis », in J. Chemometrics, 20 : 221-229, 2006.
Garson, « Partial Least Squares Regression (PLS) », http://www2.chass.ncsu.edu/garson/PA765/pls.htm

Sunday, July 17, 2011

Tanagra add-on for OpenOffice Calc 3.3

Tanagra add-on for OpenOffice 3.3 and LibreOffice 3.4.

The connection with spreadsheet applications is certainly a factor of success for Tanagra. It is easy to manipulate a dataset into OpenOffice Calc (up to version 3.2) and send it to Tanagra using the TanagraLibrary.zip extension for further analysis .

Recently, users have reported to me that the mechanism did not work with recent versions of OpenOffice (version 3.3) and LibreOffice (version 3.4). I realized that, rather than a correction, it was more appropriate to elaborate a new module which meets the standard for managing extensions of these tools. The new library "TanagraModule.oxt" is now incorporated into the distribution.

This tutorial describes how to install and to use this add-on under OpenOffice Calc 3.0. The adaptation to LibreOffice 3.4 is very easy.

Keywords : data importation, spreadsheet application, openoffice, libreoffice, add-in, add-on, excel
Component : View Dataset
Tutorial : en_Tanagra_Addon_OpenOffice_LibreOffice.pdf
Dataset : breast.ods
Références :
Tutoriel Tanagra, "OOo Calc file handling using an add-in"
Tutoriel Tanagra, "Launching Tanagra from OOo Calc under Linux"

Tuesday, July 5, 2011

Tanagra - Version 1.4.40

Few improvements for this new version.

A new addon for the connection between Tanagra and the recent version of OpenOffice Calc spreadsheet has been created. The old one did not work for recent versions - OpenOffice 3.3 and LibreOffice 3.4. During the installation process, another library was added ("TanagraModule.oxt") to not interfere with the old, still functional for previous versions of Open Office (3.2 and earlier). A tutorial describing its installation and its utilization will be put online soon. I take this opportunity to highlight again how a privileged connection between a spreadsheet and a specialized tool for Data Mining is convenient. The annual poll organized by the kdnuggets.com website shows the interest of this connection (2011, 2010, 2009,...). We note that there is a similar addon for the R software (R4Calc). This change was suggested by Jérémy Roos (OpenOffice) and Franck Thomas (LibreOffice).

The non-standardized ACP is now available. It is possible to implement unchecking the option of standardization of the data in the Principal Component Analysis component. Change suggested by Elvire Antanjan.

Simultaneous regression was introduced. It is very similar to the method programmed into LazStats, which is unfortunately more accessible freely now. The approach is described in a free booklet online "Practice of linear regression analysis" (in French) (section 3.6).

The color codes according to the p-value have been introduced for the Linear Correlation component. Change suggested by Samuel KL.

Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.

Donwload page : setup