Tanagra - Data Mining and Data Science Tutorials: July 2010

Saturday, July 24, 2010

Naive bayes classifier for discrete predictors

The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Yet, despite this, it appears robust and efficient. Its performance is comparable to other supervised learning techniques.

We introduce in Tanagra (version 1.4.36 and later) a new presentation of the results of the learning process. The classifier is easier to understand, and its deployment is also made easier.

In the first part of this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the model) to those obtained with other linear approaches such as the logistic regression, the linear discriminant analysis and the linear SVM. We note that the results are highly consistent. This largely explains the good performance of the method in comparison to others.

In the second part, we use various tools on the same dataset (Weka 3.6.0, R 2.9.2, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0). We try above all to understand the obtained results.

Keywords: naive bayes, linear classifier, linear discriminant analysis, logistic regression, linear support vector machine, svm
Components: NAIVE BAYES, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, SVM, 0_1_BINARIZE
Tutorial: en_Tanagra_Naive_Bayes_Classifier_Explained.pdf
Dataset: heart_for_naive_bayes.zip
References :
Wikipedia, "Naive bayes classifier".
T. Mitchell, "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression", in Machine Learning, Chapter 1, 2005.

Wednesday, July 21, 2010

Interactive decision tree learning with Spad

In this tutorial, we will be interested in SPAD. This is a French software specialized in exploratory data analysis which evolved much these last years. We would perform a sequence of analysis from a dataset collected into 3 worksheets of a Excel data file: (1) we create a classification tree from the learning sample into the first worksheet, we try to analyze deeply some nodes of the tree to highlight the characteristics of covered instances, we try also to modify interactively (manually) the properties of some splitting operation; (2) we apply the classifier on unseen cases of the second worksheet; (3) we compare the prediction of the model with the actual values of the target attribute contained into the third worksheet.

Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial tools (e.g. SPSS Modeler, SAS Enterprise Miner, STATISTICA Data Miner…), we can very easily specify the whole sequence, even if we are not especially familiarized with data mining tools.

Keywords: decision tree, classification tree, interactive decision tree, spad, sipina, r software
Tutorial: en_Tanagra_Arbres_IDT_Spad.pdf
Dataset: pima-arbre-spad.zip
References :
SPAD, http://www.spad.eu/
SIPINA, http://eric.univ-lyon2.fr/~ricco/sipina.html
R Project, http://www.r-project.org/

Monday, July 12, 2010

Supervised learning from imbalanced dataset

In real problems, the classes are not equally represented in dataset. The instances corresponding to positive class, the one that we want to detect often, are few. For instance, in a fraud detection problem, there are a very few cases of fraud comparing to the large number of honest connections; in a medical problem, the ill persons are fortunately rare; etc. In these situations, using the standard learning process and assessing the classifier with the confusion matrix and the misclassification rate are not appropriate. We observe that the default classifier consisting to assign all the instances to the majority class is the one which minimizes the error rate.

For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a good classifier, especially because it does not supply a degree of membership to the classes (Note: in fact, it assigns the same degree of membership to all the instances).

A strategy enables to improve the behavior of the learning algorithms facing to the imbalance problem is to artificially balance the dataset. We can do this by eliminating some instances of the over-sized class (downsizing) or by duplicating some instances of the small class (over sampling). But few persons analyze the consequence of this solution on the performance of the classifier.

In this tutorial, we highlight the consequences of the downsizing on the behavior of the logistic regression.

Keywords: imbalanced dataset, logistic regression, over sampling, under sampling
Components: BINARY LOGISTIC REGRESSION, DISCRETE SELECT EXAMPLES, SCORING, RECOVER EXAMPLES, ROC CURVE, TEST
Tutorial : en_Tanagra_Imbalanced_Dataset.pdf
Dataset : imbalanced_dataset.xls
References :
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.