Wednesday, December 30, 2015

Random Forest - Boosting with R and Python

This tutorial follows the slideshow devoted to the "Bagging, Random Forest and Boosting". We show the implementation of these methods on a data file. We will follow the same steps as the slideshow i.e. we first describe the construction of a decision tree, we measure the prediction performance, and then we see how ensemble methods can improve the results. Various aspects of these methods will be highlighted: the measure of the variable importance, the influence of the parameters, the influence of the characteristics of the underlying classifier (e.g. controlling the tree size), etc.

As a first step, we will focus on R (rpart, adabag and randomforest packages) and Python (scikit-learn package). We can multiply analyses by programming. Among others, we can evaluate the influence of parameters on the performance. As a second step, we will explore the capabilities of software (Tanagra and Knime) providing turnkey solutions, very simple to implement, more accessible for people which do not like programming.

Keywords: R software, R programming, decision tree, classification tree, adabag package, rpart package, randomforest package, Python, scikit-learn package, bagging, boosting, random forest
Components: BAGGING, RND TREE, BOOSTING, C4.5, DISCRETE SELECT EXAMPLES
Tutorial: Bagging, Random Forest et Boosting
Files: randomforest_boosting_en.zip
References:
R. Rakotomalala, "Bagging, Random Forest, Boosting (slides)", December 2015.