Thursday, November 26, 2009

Three curves for classifier assessment

Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier. On one hand we have the confusion matrix and associated indicators, very popular into the academic publications. On the other hand, in the real applications, the users prefers some curves which seem very mysterious for people outside the domain (e.g. ROC curve for the epidemiologists, gain chart or cumulative lift curve in the marketing domain, precision recall curve in the information retrieval domain, etc.).

In this tutorial, we give first the details of the calculation of these curves by creating them "at the hand" in a spreadsheet. Then, we use Tanagra 1.4.33 and R 2.9.2 for obtaining them. We use these curves for the comparison the performances of the logistic regression and support vector machine (Radial Basis Function kernel).

Keywords: roc curve, gain chart, precision recall curve, lift curve, logistic regression, support vector machine, svm, radial basis function kernel, rbf kernel, e1071 package, R software, glm
Components: DISCRETE SELECT EXAMPLES, BINARY LOGISTIC REGRESSION, SCORING, C-SVC, ROC CURVE, LIFT CURVE, PRECISION-RECALL CURVE
Tutorial: en_Tanagra_Spv_Learning_Curves.pdf
Dataset : heart_disease_for_curves.zip