Thursday, October 14, 2010

Filter methods for feature selection

The nature of the predictors' selection process has changed considerably. Previously, works in machine learning concentrated on the research of the best subset of features for a learning classifier, in the context where the number of candidate features was rather reduced and the computing time was not a major constraint. Today, it is common to deal with datasets comprising thousands of descriptors. Consequently, the problem of feature selection always consists in finding the most relevant subset of predictors but by introducing a new strong constraint: the computing time must remain reasonable.

In this tutorial, we are interested in correlation based filter approaches for discrete predictors. The goal is to highlight the most relevant subset of predictors which are highly correlated with the target attribute and, in the same time, which are weakly correlated between them i.e. which are not redundant. To evaluate the behavior of the various methods, we use an artificial dataset where we add irrelevant and redundant candidate variables. Then, we perform a feature selection based on the approaches analyzed. We compare the generalization error rate of the naive bayes classifier learned from the various subsets of selected variables. We lead the experimentation with Tanagra in a first time. Then, in a second time, we show how to perform the same analysis with other tools (Weka 3.6.0, Orange 2.0b, RapidMiner 4.6.0, R 2.9.2 - package FSelector).

Keywords: filter, feature selection, correlation based measure, discrete predictors, naive bayes classifier, bootstrap
Components: FEATURE RANKING, CFS FILTERING, MIFS FILTERING, FCBF FILTERING, MODTREE FILTERING, NAIVE BAYES, BOOTSTRAP
Tutorial: en_Tanagra_Filter_Method_Discrete_Predictors.pdf
Dataset: vote_filter_approach.zip
References:
Tanagra, "Feature Selection"