Saturday, October 29, 2011

Decision tree and large dataset (continuation)

One of the exciting aspects of computing is that things are changing very quickly. The machines are ever more efficient, the operating systems are improved, the software also. Since writing an old tutorial about the induction of decision tree on a large dataset, I have a new computer and I use a 64 bit OS (Windows 7). Some of the tools studied propose a 64 bit version (Knime, RapidMiner, R). I wonder how behave the various tools in this new context. To do that, I renewed the same experiment.

We note that a more efficient computer allows to improve the computation time (about 20%). The specific gain for a 64 bit version is relatively low, but it is real (about 10%). And some tools are clearly improved their programming of the decision tree induction (Knime, RapidMiner). On the other hand, we observe that the memory occupation remains stable for the most of the tools in the new context.

Keywords: c4.5, decision tree, large dataset, wave dataset, knime2.4.2, orange 2.0b, r 2.13.2, rapidminer 5.1.011, sipina 3.7, tanagra 1.4.41, weka 3.7.4, windows 7 - 64 bits
Components: SUPERVISED LEARNING, C4.5
Tutorial: en_Tanagra_Perfs_Comp_Decision_Tree_Suite.pdf
Screenshots : Experiment screenshots.
Dataset: wave500k.zip 
References:
Tanagra, "Decision tree and large dataset".
R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.