Sunday, April 3, 2016

Categorical predictors in logistic regression

The aim of the logistic regression is to build a model for predicting a binary target attribute from a set of explanatory variables (predictors, independent variables), which are numeric or categorical. They are treated as such when they are numeric. We must recode them when they are categorical. The dummy coding is undeniably the most popular approach in this context.

The situation becomes more complicated when we perform a feature selection. The idea is to determine the predictors that contribute significantly to the explanation of the target attribute. There is no problem when we consider a numeric variable. It is either excluded or either kept in the model. But how to proceed when we handle a categorical explanatory variable? Should we treat the dichotomous variables associated to a categorical predictor as a whole that we must exclude or include into the model? Or should we treat the each dichotomous variable independently? How to interpret the coefficients of the selected dichotomous variables in this case?

In this tutorial, we study the approaches proposed by various tools: R 3.1.2, SAS 9.3, Tanagra 1.4.50 and SPAD 8.0. We will see that feature selection algorithms rely on specific criteria according to the software. We will see also that they use different approaches when we are in the presence of the categorical predictor variables.

Keywords: logistic regression, dummy coding, categorical predictor variables, feature selection
Tutorial: Feature selection - Categorical predictors - Logistic Regression
Dataset: heart-c.xlsx 
Wikipedia, "Logistic Regression"