Thursday, July 9, 2009

Resampling methods for error estimation

The ability to predict correctly is one of the most important criteria to evaluate classifiers in supervised learning. The preferred indicator is the error rate (1 - accuracy rate). It states the probability of misclassification of a classifier. In most cases we do not know the true error rate because we do not have the whole population and we do not know the probability distribution of the data. So we need to compute estimation from the available dataset.

In the small sample context, it is preferable to implement the resampling approaches for error rate estimation. In this tutorial, we study the behavior of the cross validation (cv), leave one out (lvo) and bootstrap (boot). All of them are based on the repeated train-test process, but in different configurations. We keep in mind that the aim is to evaluate the error rate of the classifier created on the whole sample. Thus, the intermediate classifiers computed on each learning session are not really interesting. This is the reason for which they are rarely provided by the data mining tools.

The main supervised learning method used is the linear discriminant analysis (LDA). We will see at the end of this tutorial that the behavior observed for this learning approach is not the same if we use another approach such as a decision tree learner (C4.5).

Keywords: resampling, generalization error rate, cross validation, bootstrap, leave one out, linear discriminant analysis, C4.5
Components: Supervised Learning, Cross-validation, Bootstrap, Test, Leave-one-out, Linear discriminant analysis, C4.5
Tutorial: en_Tanagra_Resampling_Error_Estimation.pdf
"What are cross validation and bootstrapping?"