Friday, October 31, 2008

Computing ROC Curve

TANAGRA, ORANGE and WEKA are free data mining softwares. They represent the succession of treatments as a stream diagram or a knowledge flow. Sometimes, there is a little difference between these softwares. Nevertheless, we show that in spite of these differences, these softwares often handle the same problems and give a very similar presentation of results. In this tutorial, we try to build a roc curve from a logistic regression.

Regardless the software we used, even for commercial software, we have to prepare the following steps when we want build a ROC curve.
• Import the dataset in the soft;
• Compute descriptive statistics;
• Select target and input attributes;
• Select the “positive” value of the target attribute;
• Split the dataset into learning (e.g. 70%) and test set (30%);
• Choose the learning algorithm. Be careful, the softwares can have different
implementation and present a slightly different results;
• Build the prediction model on the learning set and visualize the results;
• Build the ROC curve on the test set.

According the softwares, the progression can be different but it is clear that we must, explicitly or not, process these steps.

Keywords: supervised learning, roc curve, classifier assessment, orange, weka, training set, learning set, test set
Components: MORE UNIVARIATE CONT STAT, SAMPLING, SUPERVISED LEARNING, LOG-REG TRIRLS, SCORING, ROC CURVE
Tutorial: en_Tanagra_Orange_Weka_Roc_curve.pdf
Dataset: ds1_10.zip
References:
R. Rakotomalala – « Courbe ROC » (fr)
T. Fawcet – « ROC Graphs : Notes and Practical Considerations of Researchers »