Thursday, November 6, 2008

Feature selection for logistic regression

In some circumstances, the goal of the supervised learning is not to classify examples but rather to organize them in order to point up the most interesting individuals. For instance, in the direct marketing campaign, we want to detect the customers which are the most likely to respond to the solicitation. In this context, the confusion matrix is not really suitable for the evaluation of the predictive model. It is more valuable to use another tool, more appropriate for the evaluation of the respondents corresponding to the number of reached individuals: this is the “lift curve” (“gain chart”).

In this tutorial, we use the binary logistic regression for the construction of the gain chart. We show also that the variable selection is really useful in the context of dealing with large number of predictive variables.

We use a real/realistic dataset from a website (see Reference below). It contains 2158 examples and 200 predictive attributes. The objective variable is a response variable indicating whether or not a consumer responded to a direct mail campaign for a specific product.

Keywords: scoring, marketing campaign, logistic regression, feature selection, backward, forward, gain chart, lift curve
Components: Supervised learning, Binary logistic regression, Select examples, Scoring, Lift curve, Forward-logit, Backward-logit
Tutorial: en_Tanagra_Variable_Selection_Binary_Logistic_Regression.pdf
Dataset: dataset_scoring_bank.xls
Statistical Society of Canada, "Data Mining - Case Studies - 2000"