Thursday, March 19, 2009

Cost-sensitive learning - Comparison of tools

Everyone agrees that taking into consideration the misclassification costs is an important aspect of the practice of Data Mining. For instance, diagnosing disease for a healthy person does not produce the same consequences as to predict health for an ill person. Yet despite its importance, the topic is seldom addressed, both from a theoretical point of view i.e. how to integrate cost during the evaluation of models (easy) and their construction (a little less easy); and from the practical point of view i.e. how to implement the approach in software.

Using the misclassification cost during the classifier evaluation is easy. We make a cross-product between the misclassification cost matrix and the confusion matrix. We obtain an "expected misclassification cost" (or an expected gain if we multiply the result by -1). Its interpretation is not very easy. It is mainly used for the comparison of models.

Handling costs during the learning process is less usual. Several approaches are possible. In this tutorial, we show how to use some components of Tanagra intended to cost-sensitive supervised learning on a real (realistic) dataset. We also programmed the same procedures in the R software (http://www.r-project.org/) to give a better visibility on what is implemented. We compare our results with those of Weka. The algorithm underlying our analysis is a decision tree. According to the software, we use C4.5, CART or J48.

Keywords: supervised learning, cost sensitive learning, misclassification cost matrix, decision tree algorithm, Weka 3.5.8, R 2.8.0, rpart package
Tutorial: en_Tanagra_Cost_Sensitive_Learning.pdf
Dataset: dataset-dm-cup-2007.zip
References:
J.H. Chauchat, R. Rakotomalala, M. Carloz, C. Pelletier, "Targeting Customer Groups using Gain and Cost Matrix: a Marketing Application", PKDD-2001.
"Cost-sensitive Decision Tree", Tutorials for Sipina.