Tanagra - Data Mining and Data Science Tutorials: April 2009

Thursday, April 30, 2009

Principal Component Analysis (PCA)

The PCA belongs to the factor analysis approaches. It is used to discover the underlying structure of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors (dimensions) and as such is a "non-dependent" procedure i.e. it does not assume a dependent variable is specified.

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. We use the AUTOS_ACP.XLS dataset from the state-of-the-art SAPORTA’s book. The interest of this dataset is that we can compare our results with those described in the book (pages 177 to 181). We simply show the sequence of operations and the reading of the results tables in this tutorial. About the detailed interpretation, it is best to refer to the book.

Keywords: factor analysis, principal component analysis, correlation circle
Components: Principal Component Analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acp.pdf
Dataset: autos_acp.xls
References:
G. Saporta, " Probabilités, Analyse de données et Statistique ", Dunod, 2006 ; pages 177 to 181.
D. Garson, "Factor Analysis".
Statsoft Textbook, "Principal components and factor analysis".

Multiple Correspondence Analysis (MCA)

The multiple correspondence analysis is a factor analysis approach. It deals with a tabular dataset where a set of examples are described by a set of categorical variables. The aim is to map the dataset in a reduced dimension space (usually two) which allows us to highlight the associations between the examples and the variables. It is useful to understand the underlying structure of a tabular dataset.

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. The opportunity to copy/paste the results in a spreadsheet is certainly one of the most interesting functionalities of the software. Indeed, it gives us access to tools (tri, formatted, etc) in a well-known environment of the experts of the data processing. For example, the possibility of sorting the various tables according to the contributions and the COS2 proves really practical when one wishes to interpret the dimensions.

Keywords: factor analysis, multiple correspondence analysis
Components: Multiple correspondance analysis, View Dataset, Scatterplot with labels, View multiple scatterplot
Tutorial: en_Tanagra_Acm.pdf
Dataset: races_canines_acm.xls
References:
M. Tenenhaus, " Méthodes statistiques en gestion ", Dunod, 1996 ; pages 212 to 222 (in French).
Statsoft Inc., "Multiple Correspondence Analysis".
D. Garson, "Statnotes - Correspondence Analysis".

Sunday, April 26, 2009

Support Vector Regression (SVR)

Support Vector Machines (SVM) is a well-know approach in the machine learning community. It is usually implemented for a classification problem in a supervised learning framework. But SVM can be used also in a regression process, where we want to predict or explain the values taken by a continuous predicted attribute. We say Support Vector Regression in this context.

The method is not widely diffused among statisticians. Yet it combines the qualities that rank it favorably compared with existing techniques. It has a well behavior even if the ratio between the number of variables and the number of observations becomes very unfavorable, with highly correlated predictors. Another advantage is the principle of kernel (the famous "kernel trick"). It is possible to construct a non-linear model without explicitly having to produce new descriptors. A deeply study of the characteristics of the method allows to make comparison with penalized regression such as ridge regression.

The first subject of this tutorial is to show how to use two new SVR components of the 1.4.31 version of Tanagra. They are based on the famous LIBSVM library. We use the same library for the classification (see C-SVC component). We compare our results to those of the R software (version 2.8.0). We utilize the e1071 package for R. It is also based on the LIBSVM library.

The second subject is to propose a new assessment component for the regression. It is usual in the supervised learning framework to split the dataset into two parts, the first for the learning process, the second for its evaluation, in order to obtain an unbiased estimation of the performances. We can implement the same approach for the regression. The procedure is even essential when we try to compare models with various complexities (or various degrees of freedom). We will see in this tutorial that the usual indicators calculated on the learning data are highly misleading in certain situations. We must use an independent test set when we want assess a model.

Keywords: support vector regression, support vector machine, regression, linear regression, regression assessment, R software, package e1071
Components: MULTIPLE LINEAR REGRESSION, EPSILON SVR, NU SVR, REGRESSION ASSESSMENT
Tutorial: en_Tanagra_Support_Vector_Regression.pdf
Dataset: qsar.zip
References :
C.C. Chang, C.J. Lin, "LIBSVM - A Library for Support Vector Machines".
S. Gunn, « Support Vector Machine for Classification and Regression », Technical Report of the University of Southampton, 1998.
A. Smola, B. Scholkopf, « A tutorial on Support Vector Regression », 2003.

Thursday, April 23, 2009

Launching Tanagra from OOo Calc under Linux

The integration of Tanagra into a spreadsheet, such as Excel or Open Office Calc (OOo Calc or OOCalc), is undoubtedly an advantage. Without special knowledge about the database format, the user can handle the dataset into a familiar environment, the spreadsheet, and send it to specialized tools for Data Mining when he want to lead more sophisticated analysis.

The add-on for OOCalc is initially created for Windows OS. Recently, I have described the installation and the utilization of Tanagra under Linux . The next step is of course the integration of Tanagra into OOCalc under Linux.Mr. Thierry Leiber has realized this work for the 1.4.31 version of Tanagra. He has extended the existing add-on. We can launch Tanagra from OOCalc now, either under Windows and Linux. The add-on was tested under the following configurations: Windows XP + OOCalc 3.0.0; Windows Vista + OOCalc 3.0.1; Ubuntu 8.10 + OOCalc 2.4; Ubuntu 8.1 + OOCalc 3.0.1.

This document extends a previous tutorial, but we work now under the Linux environment (Ubuntu 8.10). All the screen shots are in French because my OS is in French, but I think the process is the same for Linux with other language configuration.

Keywords: open office calc, add-on, principal component analysis, PCA, correlation circle, illustrative variable, linux, ubuntu 8.10 intrepid ibex
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT
Tutorial: en_Tanagra_OOCalc_under_Linux.pdf
Dataset: cereals.xls
References:
Tanagra, « Connection with Open Office Calc »
Tanagra, « Tanagra under Linux »

Wednesday, April 15, 2009

Tanagra - Version 1.4.31

Thierry Leiber has improved the add-on making the connection between Tanagra and Open Office. It is now possible, under Linux, to install the add-on for Open Office and launch Tanagra directly after selecting the data (see the tutorials on installing Tanagra under Linux and the integration of add-on in Open Office Calc). Thierry, thank you very much for this contribution which helps the users of Tanagra.

Following a suggestion of Mr. Laurent Bougrain, the confusion matrix is added to the automatic saving of results in experiments. Thank you to Laurent, and all others, who by their constructive comments helps me upgrade Tanagra in the right direction.

In addition, two new components for regression using the support vector machine principle (support vector regression) were added: Epsilon-Nu-SVR and SVR. A tutorial shows these methods and compare our results with the R software will be available soon. Tanagra, as with the R package "e1071", are based on the famous LIBSVM library.

Tutorials about these releases are coming soon.