Tanagra - Data Mining and Data Science Tutorials: January 2010

Tuesday, January 19, 2010

Tanagra - Version 1.4.35

CTP. The method of detection of the right size of the tree is modified for the "Clustering Tree" with post-pruning component (CTP). It relies both on the angle between half-lines at each point on the curve of decreasing the WSS (within-group sum of squares) on the growing sample and the decrease of the same indicator computed on the pruning sample. Compared to the previous implementation, it results in a smaller number of clusters.

Regression Tree. The previous modification is incorporated into the Regression Tree component which is a univariate version of CTP.

C-RT Regression Tree. A new regression tree component was added. It faithfully implements the technique described in the Breiman's and al. (1984) book, including the post-pruning part with the 1-SE Rule (Chapter 8, especially p. 226 about the formula for the variance of the MSE).

C-RT. The report of the induction of decision tree C-RT has been completed. Based on the last column of the post-pruning table, it becomes easier to choose the parameter x (in x-SE Rule) to arbitrarily define the size of the pruned tree.

Some tutorials will describe these various changes soon.

Monday, January 4, 2010

Dealing with very large dataset in Sipina

The ability to handle large databases is a crucial problem in the data mining context. We want to handle a large dataset in order to detect the hidden information. Most of the free data mining tools have problems with large dataset because they load all the instances and variables into memory. Thus, the limitation of these tools is the available memory.

To overcome this limitation, we should design solutions that allow to copy all or part of the data on disk, and perform treatments by loading into memory only what is necessary at each step of the algorithm (the instances and/or the variables). If the solution is theoretically simple, it is difficult in practice. Indeed, the processing time should remain reasonable even if we increase the disk access. It is very difficult to implement a strategy that is effective regardless of the learning algorithm used (supervised learning, clustering, factorial analysis, etc.). They handle the data in very different way: some of them use intensively matrix operations; the others search mainly the co-occurrence between attribute-value pairs, etc.

In this tutorial, we present a specific solution in the induction tree context. The solution is integrated into SIPINA (as optional) because its internal data structure is especially intended to the decision tree induction. Developing an approach which takes advantages of the specificities of the learning algorithm was easy in this context. We show that it is then possible to handle a very large dataset (41 variables and 9,634,198 observations) and to use all the functionalities of the tool (interactive construction of the tree, local descriptive statistics on nodes, etc.).

To fully appreciate the solution proposed by Sipina, we compare its behavior to generalist data mining tools such as Tanagra 1.4.33 or Knime 2.03.

Keywords: very large dataset, decision tree, sampling, sipina, knime
Components: ID3
Lien : en_Sipina_Large_Dataset.pdf
Données : twice-kdd-cup-discretized-descriptors.zip
Références :
Tanagra, « Decision tree and large dataset ».
Tanagra, « Local sampling for decision tree learning »

Saturday, January 2, 2010

CART - Determining the right size of the tree

Determining the appropriate size of the tree is a crucial task in the decision tree learning process. It determines its performance during the deployment into the population (the generalization process). There are two situations to avoid: the under-sized tree, too small, poorly capturing relevant information in the training set; the over-sized tree capturing specific information of the training set, which specificities are not relevant to the population. In both cases, the prediction model performed poorly during the generalization phase.

Among the many variants of decision trees learning algorithms, CART is probably the one that detects better the right size of the tree.

In this tutorial, we describe the selection mechanism used by CART during the post-pruning process. We show also how to set the appropriate value of the parameter of the algorithm in order to obtain a specific (a user-defined) tree.

Keywords: decision tree, CART, 1-SE Rule, post-pruning
Components: Discrete select examples, Supervised Learning, C-RT, Test
Tutorial: en_Tanagra_Tree_Post_Pruning.pdf
Dataset: adult_cart_decision_trees.zip
References :
L. Breiman, J. Friedman, R. Olshen, C. Stone, " Classification and Regression Trees ", California : Wadsworth International, 1984.
R. Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (tutoriel_arbre_revue_modulad_33.pdf)