Saturday, August 6, 2011

The CART method under Tanagra and R (rpart)

CART (Breiman and al., 1984) is a very popular classification tree (says also decision tree) learning algorithm. Rightly. CART incorporates all the ingredients of a good learning control: the post-pruning process enables to make the trade-off between the bias and the variance; the cost complexity mechanism enables to "smooth" the exploration of the space of solutions; we can control the preference for simplicity with the standard error rule (SE-rule); etc. Thus, the data miner can adjust the settings according to the goal of the study and the data characteristics.

The Breiman's algorithm is provided under different designations in the free data mining tools. Tanagra uses the "C-RT" name. R, through a specific package , provides the "rpart" function.

In this tutorial, we describe these implementations of the CART approach according to the original book (Breiman and al., 1984; chapters 3, 10 and 11). The main difference between them is the implementation of the post-pruning process. Tanagra uses a specific sample says "pruning set" (section 11.4); when rpart is based on the cross-validation principle (section 11.5) .

Keywords: decision tree, classification tree, recursive partitioning, cart, R software, rpart package
Tutorial: en_Tanagra_R_CART_algorithm.pdf
Dataset: wave5300.xls
Breiman, J. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Chapman & Hall, 1984.
"The R project for Statistical Computing" -