Sunday, August 14, 2011

PLS Regression - Software comparison

Comparing the behavior of tools is always a good way to improve them.

To check and validate the implementation of methods. The validation of the implemented algorithms is an essential point for data mining tools. Even if two programmers use the same references (books, articles), the programming choice can modify the behavior of the approach (behaviors according to the interpretation of the convergence conditions for instance). The analysis of the source code is possible solution. But, if it is often available for free software, this is not the case for commercial tools. Thus, the only way to check them is to compare the results provided by the tools on a benchmark dataset . If there are divergences, we must explain them by analyzing the formulas used.

To improve the presentation of results. There are certain standards to observe in the production of reports, consensus initiated by reference books and / or leader tools in the field. Some ratios should be presented in a certain way. Users need reference points.

Our programming of the PLS approach is based on the Tenenhaus book (1998) which, itself, make reference to the SIMCA-P tool. Using the access to a limited version of this software (version 11), we have check the results provided by Tanagra on various datasets. We show here the results of the study on the CARS dataset. We extend the comparison to other data mining tools.

Keywords: pls regression, software comparison, simca-p, spad, sas, r software, pls package
Tutorial: en_Tanagra_PLSR_Software_Comparison.pdf
Dataset: cars_pls_regression.xls
References :
M. Tenenhaus, « La régression PLS – Théorie et pratique », Technip, 1998.
D. Garson, « Partial Least Squares Regression », from Statnotes: Topics in Multivariate Analysis.
UMETRICS. SIMCA-P for Multivariate Data Analysis.