Thursday, November 13, 2008

Decision tree and large dataset

Dealing with large dataset is on of the most important challenge of the Data Mining. In this context, it is interesting to analyze and to compare the performances of various free implementations of the learning methods, especially the computation time and the memory occupation. Most of the programs download all the dataset into memory. The main bottleneck is the available memory.

In this tutorial, we compare the performance of several implementations of the C4.5 algorithm (Quinlan, 1993) when processing a file containing 500,000 observations and 22 variables. The programs used are: Knime 1.3.5; Orange 1.0b2; R (rpart package) 2.6.0; RapidMiner Community Edition; Sipina Research; Tanagra 1.4.27; Weka 3.5.6.

Our data file is well-known artificial dataset described in the CART book (Breiman et al., 1984). We have generated a dataset with 500.000 observations. The class attribute has 3 values, there are 21 continuous predictors.

Keywords: c4.5, decision tree, classification tree, large dataset, knime, orange, r, rapidminer, sipina, tanagra, weka
Components: SUPERVISED LEARNING, C4.5
Tutorial: en_Tanagra_Perfs_Comp_Decision_Tree.pdf
Dataset: wave500k.zip
Reference: R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.