Tuesday, January 27, 2009

Performance comparison under Linux

The gain chart is an alternative to confusion matrix for the evaluation of a classifier. Its name is sometimes different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.).

The main idea is to elaborate a graph where the X coordinates is the percent of the population and the Y coordinates is the percent of the positive value of the class attribute. The gain chart is used mainly in the marketing domain where we want to detect potential customers, but it can be used in other situations.

The construction of the gain chart is already outlined in a previous tutorial (see http://data-mining-tutorials.blogspot.com/2008/11/lift-curve-coil-challenge-2000.html). In this tutorial, we extend the description to other data mining tools (Knime, RapidMiner, Weka and Orange). The second originality of this tutorial is that we lead the experiment under Linux (French version of Ubuntu 8.10 – see http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html for the installation and the utilization of Tanagra under Linux). The third originality is that we handle a large dataset with 2,000,000 examples and 41 variables. It will be very interesting to study the behavior of these tools in this configuration, especially because our computer is not really powerful. We note that some tools failed the analysis on the complete dataset.

Keywords: scoring, linear discriminant analysis, naive bayes classifier, lift curve, gain chart, cumulative gain chart, knime, rapidminer, weka, orange
Tutorial: en_Tanagra_Gain_Chart.pdf
Dataset: dataset_gain_chart.zip