Friday, May 1, 2009

ID3 on a large dataset

In the data mining domain, the increasing size of the dataset is one of the major challenges in the recent years. The ability to handle large data sets is an important criterion to distinguish between research and commercial software.

Commercial tools have often a very efficient data management systems, limiting the amount of data loaded into memory at each step of the treatment. Research tools, at the opposite, keep all data in memory. The limits are clearly the memory capacity of the machine in this context. It is certainly a drawback for the treatment of large files. We note however that, nowadays, we can have very powerful computers at least cost, this drawback is always postponed. With an appropriate encoding strategy, we can fit in memory all the dataset, even if we handle a large data file.

In this tutorial, we show how to import a file with 581,012 observations and 55 variables, and then how to build a decision tree with the ID3 method. In relation to other decision tree algorithm such as C4.5 or CART, the determination of the right size of the tree is based on a pre-pruning rule. We will see that the computation is fast because of this characteristic.

Keywords: large dataset, decision tree algorithm, ID3
Components: ID3, SPV LEARNING
Tutorial: en_Tanagra_Big_Dataset.pdf
Dataset: covtype.zip
References:
Tanagra tutorials, "Performance comparison under Linux"
Tanagra Tutorials, "Decision tree and large dataset"