Sunday, April 26, 2009

Support Vector Regression (SVR)

Support Vector Machines (SVM) is a well-know approach in the machine learning community. It is usually implemented for a classification problem in a supervised learning framework. But SVM can be used also in a regression process, where we want to predict or explain the values taken by a continuous predicted attribute. We say Support Vector Regression in this context.

The method is not widely diffused among statisticians. Yet it combines the qualities that rank it favorably compared with existing techniques. It has a well behavior even if the ratio between the number of variables and the number of observations becomes very unfavorable, with highly correlated predictors. Another advantage is the principle of kernel (the famous "kernel trick"). It is possible to construct a non-linear model without explicitly having to produce new descriptors. A deeply study of the characteristics of the method allows to make comparison with penalized regression such as ridge regression.

The first subject of this tutorial is to show how to use two new SVR components of the 1.4.31 version of Tanagra. They are based on the famous LIBSVM library. We use the same library for the classification (see C-SVC component). We compare our results to those of the R software (version 2.8.0). We utilize the e1071 package for R. It is also based on the LIBSVM library.

The second subject is to propose a new assessment component for the regression. It is usual in the supervised learning framework to split the dataset into two parts, the first for the learning process, the second for its evaluation, in order to obtain an unbiased estimation of the performances. We can implement the same approach for the regression. The procedure is even essential when we try to compare models with various complexities (or various degrees of freedom). We will see in this tutorial that the usual indicators calculated on the learning data are highly misleading in certain situations. We must use an independent test set when we want assess a model.

Keywords: support vector regression, support vector machine, regression, linear regression, regression assessment, R software, package e1071
Components: MULTIPLE LINEAR REGRESSION, EPSILON SVR, NU SVR, REGRESSION ASSESSMENT
Tutorial: en_Tanagra_Support_Vector_Regression.pdf
Dataset: qsar.zip
References :
C.C. Chang, C.J. Lin, "LIBSVM - A Library for Support Vector Machines".
S. Gunn, « Support Vector Machine for Classification and Regression », Technical Report of the University of Southampton, 1998.
A. Smola, B. Scholkopf, « A tutorial on Support Vector Regression », 2003.