Thursday, May 3, 2012

Using PDI-CE for model deployment (PMML)

Model deployment is a crucial task of the data mining process. In the supervised learning, it can be the applying of the predictive model on new unlabeled cases. We have already described this task for various tools (e.g. Tanagra, Sipina, Spad, R). They have as common feature the use of the same tool for the model construction and the model deployment.

In this tutorial, we describe a process where we do not use the same tool for the model construction and the model deployment. This is only possible if (1) the model is described in a standard format, (2) the tool which used for the deployment can handle both the database with unlabeled instances and the model. Here, we use the PMML standard description for the sharing of the model, and the PDI-CE (Pentaho Data Integration Community Edition) for the applying of the model on the unseen cases.

We create a decision tree with various tools such as SIPINA, KNIME or RAPIDMINER; we export the model in the PMML format; then, we use PDI-CE for applying the model on a data file containing unlabeled instances. We see that the use of the PMML standard enhances dramatically the powerful of both the data mining tool and the ETL tool.

In addition, we will describe other solutions for deployment in this tutorial. We will see that Knime has its own PMML reader. It is able to apply a model on unlabeled datasets, whatever the tool used for the construction of the model. The key is that the PMML standard is respected. In this sense, Knime can be substituted to PDI-CE. Another possible solution, Weka, which is included into the Pentaho Community Edition suite, can export the model in a proprietary format that PDI-CE can handle.

Keywords: model deployment, predictive model, pmml, decision tree, rapidminer 5.0.10, weka 3.7.2, knime 2.1.1, sipina 3.4
Tutorial: en_Tanagra_PDI_Model_Deployment.pdf
Data Mining Group, "PMML standard"
Pentaho, "Pentaho Kettle Project"
Pentaho, "Using the Weka Scoring Plugin"