Tuesday, March 30, 2010

"Wrapper" for feature selection

The feature selection is a crucial aspect of supervised learning process. We must determine the relevant variables for the prediction of the target variable. Indeed, a simpler model is easier to understand and interpret; the deployment will be facilitated, we need less information to collect for prediction; finally, a simpler model is often more robust in generalization i.e. when we want to classify an unseen instance from the population.

Three kinds of approaches are often highlighted into the literature. Among them, the WRAPPER approach uses explicitly a performance criterion during the search of the best subset of descriptors. Most often, this is the error rate. But in reality, any kind of criteria can be used. This may be the cost if we use a misclassification cost matrix. It can be the area under curve (AUC) when we assess the classifier using ROC curves, etc. In this case, the learning method is considered as a black box. We try various subsets of predictors. We will choose the one that optimizes the criterion.

In this tutorial, we implement the WRAPPER approach with SIPINA and R 2.9.2. For this last one, we give the source code for a forward search strategy. The readers can easily adapt the program to other dataset. Moreover, a careful reading of the source code for R gives a better understanding about the calculations made internally by SIPINA.

The WRAPPER strategy is a priori the best since it explicitly optimizes the performance criterion. We verify this by comparing the results with those provided by the FILTER approach (FCBF method) available into TANAGRA. The conclusions are not as obvious as one can think.

Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, fcbf, sipina, R software, RWeka paclage
Tutorial: en_Tanagra_Sipina_Wrapper.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.

Tuesday, March 23, 2010

Tanagra - Version 1.4.36

ReliefF is a component for automatic variable selection in a supervised learning task. It can handle both continuous and discrete descriptors. It can be inserted before any supervised method.

Naive Bayes was modified. It now described a prediction model in an explicit form (in a linear combination form), easy to understand and to deploy.