Monday, August 22, 2011

Predictive model deployment with R (filehash)

Model deployment is the last task of the data mining steps. It corresponds to several aspects e.g. generating a report about the data exploration process, highlighting the useful results; applying models within an organization's decision making process; etc .

In this tutorial, we look at the context of predictive data mining. We are concerned about the construction of the model from a labeled dataset; the storage of the model; the distribution of the model, without the dataset used for its construction; the application of the model on new instances in order to assign them a class label from their description (the values of the descriptors).

We describe the filehash package for R which allows to deploy a model easily. The main advantage of this solution is that R can be launched under various operating systems. Thus, we can create a model with R under Windows; and apply the model in another environment, for instance with R under Linux. The solution can be easily generalized on a large scale because it is possible to launch R in batch mode. The update of the system will concern only the model file in the future.

We will write three R programs to distinguish the steps of the deployment process. The first one constructs a model from the dataset and stores it into a binary file (filehash format). The second one loads the model in another R session and uses it to label new instances from a second data file. The predictions are stored in a data file (CSV file format). Last, the third program loads the predictions and another data file containing the observed labels for these instances, and calculates the confusion matrix and the generalization error rate.

We use various predictive models in order to check the flexibility of the solutions. We tried the following ones: decision tree (rpart); logistic regression (glm); linear discriminant analysis (lda); linear discriminant analysis from factors of principal component analysis (lda + pca). This last one allowed to check if the system remains operational when we manipulate a combination of models.

Keywords: R software, filehash package, deployment, predictive model, rpart, lda, pca, glm, decision tree, linear discriminant analysis, logistic regression, principal component analysis, linear discriminant analysis on latent variables
Tutorial: en_Tanagra_Deploying_Predictive_Models_with_R.pdf
Dataset: pima-model-deployment.zip
References:
R package, "Filehash : Simple key-value database"
Kdnuggets, "Data mining deployment Poll"