Tanagra - Data Mining and Data Science Tutorials: December 2011

Saturday, December 31, 2011

Tanagra add-in for Excel 2010 - 64-bit version

The current Tanagra.xla add-in is valid to the 32-bit version of Excel (up to Excel 2010), even if we are working under 64-bit version of Windows. It does not operate on the other hand if we want to connect the 64-bit version of Excel to Tanagra. We must modify the add-in source code. These modifications are needed up to 1.4.41 version of Tanagra. They will be automatically introduced for the upcoming versions.

In this tutorial, we show the procedure to be followed for this upgrade. The screenshots have been achieved under a French version of Excel 2007 here, but I think (I hope) that the adaptation to other versions (Excel 2010 and/or other languages) is easy.

Thank you very much to Mrs. Nathalie Jourdan-Salloum which has pointed out this problem and has suggested to me the right solution.

Keywords: data importation, xls, xlsx, excel file format, macro-complémentaire, add-in, addin, add-on
Tutorial: en_Tanagra_Addin_Excel_64_bit.pdf
References:
Tanagra, "Tanagra add-in for Office 2007 and Office 2010".

Sunday, December 11, 2011

Dealing with very large dataset (continuation)

Because I have recently updated my operating system (OS), I am wondering how the 64-bit versions of Knime 2.4.2 and RapidMiner 5.1.011 could handle a very large dataset, which cannot be loaded into main memory on a 32-bit OS. This article completes a previous study where we deal with a moderate sized dataset with 500,000 instances and 22 variables. Here, we handle a dataset with 9,634,198 instances and 41 variables. We have already used this dataset in another tutorial. We showed that we cannot perform a decision tree induction on this kind of database without a swapping system, which is implemented into the SIPINA, on a 32-bit OS. We note that Tanagra can handle the dataset, but this is because it encodes the values of the categorical attributes with a byte. The memory occupation remains moderate.

In this tutorial, I analyze the behavior of the 64-bit Knime and RapidMiner on this database. I use 64-bit OS and tools, but I have "only" 4 GB of available memory on my personal computer.

Keywords: very large dataset, decision tree, sampling, sipina, knime, rapidminer
Components: ID3
Tutorial: en_Tanagra_Tree_Very_Large_Dataset.pdf
Dataset: twice-kdd-cup-discretized-descriptors.zip
References:
Tanagra, "Dealing with very large dataset in Sipina".
Tanagra, "Decision tree and large dataset (continuation)".
Tanagra, "Decision tree and large dataset".
Tanagra, "Local sampling for decision tree learning".