Monday, July 12, 2010

Supervised learning from imbalanced dataset

In real problems, the classes are not equally represented in dataset. The instances corresponding to positive class, the one that we want to detect often, are few. For instance, in a fraud detection problem, there are a very few cases of fraud comparing to the large number of honest connections; in a medical problem, the ill persons are fortunately rare; etc. In these situations, using the standard learning process and assessing the classifier with the confusion matrix and the misclassification rate are not appropriate. We observe that the default classifier consisting to assign all the instances to the majority class is the one which minimizes the error rate.

For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a good classifier, especially because it does not supply a degree of membership to the classes (Note: in fact, it assigns the same degree of membership to all the instances).

A strategy enables to improve the behavior of the learning algorithms facing to the imbalance problem is to artificially balance the dataset. We can do this by eliminating some instances of the over-sized class (downsizing) or by duplicating some instances of the small class (over sampling). But few persons analyze the consequence of this solution on the performance of the classifier.

In this tutorial, we highlight the consequences of the downsizing on the behavior of the logistic regression.

Keywords: imbalanced dataset, logistic regression, over sampling, under sampling
Tutorial : en_Tanagra_Imbalanced_Dataset.pdf
Dataset : imbalanced_dataset.xls
References :
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.