Friday, February 10, 2012

Logistic regression on large dataset

The programming of fast and reliable tools is a constant challenge for a computer scientist. In the data mining context, this leads to a better capacity to handle large datasets. When we build the final model that we want to deploy, the quickness is not really important. But in the exploratory phase where we search the best model, it is decisive. It improves our chance to obtain the best model simply because we can try more configurations.

I have tried many solutions to improve the calculation times of the logistic regression. In fact, I think the performance rests heavily on the optimization algorithm used. The source code of Tanagra shows that I have greatly hesitated. Some studies have helped me about the right choice.

Several tools propose the logistic regression. It is interesting to compare their calculation times and memory occupation. I have already studied this kind of comparison in the past . The novelty here is that I use a new operating system (64 bit version of Windows 7), and some tools are especially intended for this system. The calculating capabilities are greatly improved for these tools. For this reason, I have increased the dataset size. Moreover, to make more difficult the variable selection process, I added predictive attributes that are correlated to the original descriptors, but not to the class attribute. They have not to be selected in the final model.

In this paper, in addition to Tanagra 1.4.14 (32 bit), we use R 2.13.2 (64 bit), Knime 2.4.2 (64 bit), Orange 2.0b (build 15 oct2011, 32 bit) and Weka 3.7.5 (64 bit).

Keywords: logistic regression, software comparison, glm, stepAIC, R software, knime, orange, weka
Components: BINARY LOGISTIC REGRESSION, FORWARD LOGIT
Tutorial: en_Tanagra_Perfs_Bis_Logistic_Reg.pdf
Dataset: perfs_bis_logistic_reg.zip
References:
Tanagra, "Logistic regression - Software comparison", december 2008.
T.P. Minka, « A comparison of numerical optimizers for logistic regression », 2007.