Tanagra - Data Mining and Data Science Tutorials: February 2012

Sunday, February 19, 2012

Checking missing values in Tanagra

Up to the 1.4.41 version, Tanagra does not handle missing values because it seems interesting to force the students, which are the main users of Tanagra, to think about and to propose the most appropriate solution in relation with the characteristics of their dataset and the goal of their analysis. Thus, Tanagra simply truncates the file to import from the first obstacle. This treatment often disconcerts the users, especially since no error message was sent. They wondered why, whereas the conditions look right, the data were not properly loaded.

From Tanagra 1.4.42 version, the importation of the text file format (tab separator), of the XLS file format (Excel 97-2003), and the data transfer using the add-in for Excel (up to Excel 2010 ) and LibreOffice 3.5/OpenOffice 3.3, have been modified. Tanagra reads all rows of the base. But it skips the incomplete rows and / or with inconsistencies (e.g. a column contains numeric value whereas this is a discrete attribute). And above all, an explicit error message counts the number of deleted rows. Thus, the users are better informed.

In this tutorial, we show the management of missing data when we send the data from Excel to Tanagra using the add-in Tanagra.xla. Some cells are empty into the Excel data range. This example illustrates the new behavior of Tanagra. We would get the same behavior if we import directly the XLS file or if we imported the corresponding file into the TXT format.

Keywords: missing values, missing data, inconsistent values, text file format importation, excel file format importation, add-in, add-in, tanagra.xla
Components: DATASET, VIEW DATASET
Tutorial: en_Tanagra_Missing_Data_Checking.pdf
Dataset: ronflement_with_missing_empty.zip
References:
Wikipedia, "Listwise deletion".
D.C. Howell, "Treatment of missing data".

Friday, February 10, 2012

Logistic regression on large dataset

The programming of fast and reliable tools is a constant challenge for a computer scientist. In the data mining context, this leads to a better capacity to handle large datasets. When we build the final model that we want to deploy, the quickness is not really important. But in the exploratory phase where we search the best model, it is decisive. It improves our chance to obtain the best model simply because we can try more configurations.

I have tried many solutions to improve the calculation times of the logistic regression. In fact, I think the performance rests heavily on the optimization algorithm used. The source code of Tanagra shows that I have greatly hesitated. Some studies have helped me about the right choice.

Several tools propose the logistic regression. It is interesting to compare their calculation times and memory occupation. I have already studied this kind of comparison in the past . The novelty here is that I use a new operating system (64 bit version of Windows 7), and some tools are especially intended for this system. The calculating capabilities are greatly improved for these tools. For this reason, I have increased the dataset size. Moreover, to make more difficult the variable selection process, I added predictive attributes that are correlated to the original descriptors, but not to the class attribute. They have not to be selected in the final model.

In this paper, in addition to Tanagra 1.4.14 (32 bit), we use R 2.13.2 (64 bit), Knime 2.4.2 (64 bit), Orange 2.0b (build 15 oct2011, 32 bit) and Weka 3.7.5 (64 bit).

Keywords: logistic regression, software comparison, glm, stepAIC, R software, knime, orange, weka
Components: BINARY LOGISTIC REGRESSION, FORWARD LOGIT
Tutorial: en_Tanagra_Perfs_Bis_Logistic_Reg.pdf
Dataset: perfs_bis_logistic_reg.zip
References:
Tanagra, "Logistic regression - Software comparison", december 2008.
T.P. Minka, « A comparison of numerical optimizers for logistic regression », 2007.

Saturday, February 4, 2012

Tanagra - Version 1.4.42

The Tanagra.xla add-in for Excel can work now for both the 32 and 64-bit versions of EXCEL.

With the FastMM memory manager, Tanagra can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows. The processing capabilities, especially about the handling of large datasets, are improved.

The importation of the tab-delimited text file format and xls file format (Excel 97-2003) is made safer. Previously, the importation is interrupted and the dataset is truncated when an invalid line is read (with missing or inconsistent values). Now, Tanagra skips the line and continues on the next rows. The number of skipped lines is reported into the importation report.

Donwload page : setup