Monday, September 19, 2011

Regression model deployment

Model deployment is one of the main objectives of the data mining process. We want to apply a model learned on a training set on unseen cases i.e. any people coming from the population. In the classification framework, the aim is to assign to the instance its class value from their description [e.g. Apply a classifier on a new dataset (Deployment)]. In the clustering framework, we try to detect the group which is as similar as possible to the instance according their characteristics (e.g. K-Means - Classification of a new instance).

We are concerned about the regression framework here . The aim is to predict the values of the dependent variable for unseen instances (or unlabeled instances) from the observed values on the independent variables. The process is rather basic if we handle a linear regression model. We apply the computed parameters on the unseen instances. But, it becomes difficult when we want to treat more complex models such as support vector regression with nonlinear kernels, or the models elaborated from a combination of techniques (e.g. regression from the factors of a principal component analysis). In this context, it is essential that the deployment process is directly ensured by the data mining tool.

With Tanagra, it is possible to easily deploy the regression models, even when they are the result of a combination of technique. Simply, we must prepare the data file in a particular way. In this tutorial, we describe below how to organize the data file in order to deploy various models in an unified framework: a linear regression model, a PLS regression model, a support vector regression model with a RBF (radial basis function) kernel, a regression tree model , a regression model from the factors of a principal component analysis. Then, we export the results (the predicted values for the dependent variable) in a new data file. Last, we check if the predicted values are similar according to the various models.

Keywords: model deployment, linear regression, pls regression, support vector regression, SVR, regression tree, cart, principal component analysis, pca, regression of factor scores
Components: MULTIPLE LINEAR REGRESSION, PLS REGRESSION, PLS SELECTION, C-RT REGRESSION TREE, EPSILON SVR, PRINCIPAL COMPONENT ANALYSIS, RECOVER EXAMPLES, EXPORT DATASET, LINEAR CORRELATION
Tutorial: en_Tanagra_Multiple_Regression_Deployment.pdf
Dataset: housing.xls
References :
R. Rakotomalala, Régression linéaire multiple - Diaporama (in French)