Thursday, November 11, 2010

Naive bayes classifier for continuous predictors

The naive bayes classifier is a very popular approach even if it is (apparently) based on an unrealistic assumption: the distributions of the predictors are mutually independent conditionally to the values of the target attribute. The main reason of this popularity is that the method proved to be as accurate as the other well-known approaches such as linear discriminant analysis or logistic regression on the majority of the real dataset.

But an obstacle to the utilization of the naive bayes classifier remains when we deal with a real problem. It seems that we cannot provide an explicit model for its deployment. The proposed representation by the PMML standard for instance is particularly unattractive. The interpretation of the model, especially the detection of the influence of each descriptor on the prediction of the classes is impossible.

This assertion is not entirely true. We have showed in a previous tutorial that we can extract an explicit model from the naive bayes classifier in the case of discrete predictors (see references). We obtain a linear combination of the binarized predictors. In this document, we show that the same mechanism can be implemented for the continuous descriptors. We use the standard Gaussian assumption for the conditional distribution of the descriptors. According to the heteroscedastic assumption or the homoscedastic assumption, we can provide a quadratic model or a linear model. This last one is especially interesting because we obtain a model that we can directly compare to the other linear classifiers (the sign and the values of the coefficients of the linear combination).

This tutorial is organized as follows. In the next section, we describe the approach. In the section 3, we show how to implement the method with Tanagra 1.4.37 (and later). We compare the results to those of the other linear methods. In the section 4, we compare the results provided by various data mining tools. We note that none of them proposes an explicit model that could be easy to deploy. They give only the estimated parameters of the conditional Gaussian distribution (mean and standard deviation). Last, in the section 5, we show the interest of the naive bayes classifier over the other linear methods when we handle a large dataset (the "mutant" dataset - 16,592 instances and 5,408 predictors). The computation time and the memory occupancy are clearly advantageous.

Keywords: naive bayes classifier, rapidminer 5.0.10, weka 3.7.2, knime 2.2.2, R software, package e1071, linear discriminant analysis, pls discriminant analysis, linear svm, logistic regression
Tutorial: en_Tanagra_Naive_Bayes_Continuous_Predictors.pdf
Dataset: breast ; low birth weight
References :
Wikipedia, "Naive bayes classifier"
Tanagra, "Naive bayes classifier for discrete predictors"