Tanagra - Data Mining and Data Science Tutorials: June 2015

Tuesday, June 2, 2015

Cross-validation, leave-one-out, bootstrap (slides)

In supervised learning, it is commonly accepted that one should not use the same sample to build a predictive model and estimate its error rate. The error obtained under these conditions - called resubstitution error rate - is (very often) too optimistic, leaving to believe that the model will present an excellent performance in prediction.

A typical approach is to divide the data into 2 parts (holdout approach): a first sample, said train sample is used to construct the model; a second sample, said test sample, is used to measure its performance. The measured error rate reflects honestly the model behavior in generalization. Unfortunately, on small dataset, this approach is problematic. By reducing the amount of data presented to the learning algorithm, we cannot learn correctly the underlying relation between the descriptors and the class attribute. At the same time, the part devoted to testing remains limited, the measured error has high variance.

In this document, I present resampling techniques (cross-validation, leave-one-out and bootstrap) for estimating the error rate of the model constructed from the totality of the available data. A study on simulated data (the "waves" dataset; Breiman and al., 1984) is used to analyze the behavior of approaches according to various learning algorithms (decision trees, linear discriminant analysis, neural networks [perceptron]).

Keywords: resampling, cross-validation, leave-one-out, bootstrap, error rate estimation, holdout, resubstitution, train, test, learning sample
Components (Tanagra): CROSS-VALIDATION, BOOTSTRAP, TEST, LEAVE-ONE-OUT
Slides: Error rate estimation
References:
A. Molinaro, R. Simon, R. Pfeiffer, « Prediction error estimation: a comparison of resampling methods », in Bioinformatics, 21(15), pages 3301-3307, 2005.
Tanagra tutorial, "Resampling methods for error estimation", July 2009.