Thursday, December 9, 2010
Tanagra, which is an academic tool, provides also text outputs. The programming remains simple if we see at a glance the source code. But, in order to make the presentation more attractive, it uses the HTML to format the results. I take advantage of this special feature to generate reports without making a particular programming effort. Tanagra is one of the few academic tools to be able to produce reports that can easily be displayed in office automation software. For instances, the tables can be copied into Excel spreadsheets for further calculations. More generally, the results can be viewed in a browser, regardless of data mining software.
These are the reporting features of Tanagra that we present in this tutorial.
Keywords: reporting, decision tree, c4.5, logistic regression, binary coding, roc curve, learning sample, test sample, forward, feature selection
Components: GROUP CHARACTERIZATION, SAMPLING, C4.5, TEST, O_1_BINARIZE, FORWARD-LOGIT, BINARY LOGISTIC REGRESSION, SCORING, ROC CURVE
Dataset: heart disease
Wednesday, November 24, 2010
Currently, few free tools exploit this opportunity because it is impossible to define a generic approach that would be valid regardless of the learning method used. We must modify each existing learning algorithm. For a given technique, decomposing an algorithm into elementary tasks that can execute in parallel is a research field in itself. In a second step, we must adopt a programming technology which is easy to implement.
In this tutorial, I propose a technology based on threads for the induction of decision trees. It is well suited in our context for various reasons. (1) It is easy to program with the modern programming languages. (2) Threads can share information; they can also modify common objects. Efficient synchronization tools enable to avoid data corruption. (3) We can launch multiple threads on a mono-core and mono-processor system. It is not really advantageous, but at least the system does not crash. (4) On a multiprocessor or multi-core system, the threads will actually run at the same time, with each processor or core running a particular thread. But, because of the necessity of synchronization between threads, the computation time is not divided by the number of cores in this case.
First, we briefly present the modification of the decision tree learning algorithm in order to benefit of the multithreading technology. Then, we show how to implement the approach with SIPINA (version 3.5 and later). We show also that the multithreaded decision tree learners are available in various tools such as Knime 2.2.2 or RapidMiner 5.0.011. Last, we study the behavior of the multithreaded algorithms according to the dataset characteristics.
Keywords: multithreading, thread, threads, decision tree, chaid, sipina 3.5, knime 2.2.2, rapidminer 5.0.011
Wikipedia, "Decision tree learning"
Wikipedia, "Thread (Computer science)"
Aldinucci, Ruggieri, Torquati, " Porting Decision Tree Algorithms to Multicore using FastFlow ", Pkdd-2010.
Thursday, November 11, 2010
But an obstacle to the utilization of the naive bayes classifier remains when we deal with a real problem. It seems that we cannot provide an explicit model for its deployment. The proposed representation by the PMML standard for instance is particularly unattractive. The interpretation of the model, especially the detection of the influence of each descriptor on the prediction of the classes is impossible.
This assertion is not entirely true. We have showed in a previous tutorial that we can extract an explicit model from the naive bayes classifier in the case of discrete predictors (see references). We obtain a linear combination of the binarized predictors. In this document, we show that the same mechanism can be implemented for the continuous descriptors. We use the standard Gaussian assumption for the conditional distribution of the descriptors. According to the heteroscedastic assumption or the homoscedastic assumption, we can provide a quadratic model or a linear model. This last one is especially interesting because we obtain a model that we can directly compare to the other linear classifiers (the sign and the values of the coefficients of the linear combination).
This tutorial is organized as follows. In the next section, we describe the approach. In the section 3, we show how to implement the method with Tanagra 1.4.37 (and later). We compare the results to those of the other linear methods. In the section 4, we compare the results provided by various data mining tools. We note that none of them proposes an explicit model that could be easy to deploy. They give only the estimated parameters of the conditional Gaussian distribution (mean and standard deviation). Last, in the section 5, we show the interest of the naive bayes classifier over the other linear methods when we handle a large dataset (the "mutant" dataset - 16,592 instances and 5,408 predictors). The computation time and the memory occupancy are clearly advantageous.
Keywords: naive bayes classifier, rapidminer 5.0.10, weka 3.7.2, knime 2.2.2, R software, package e1071, linear discriminant analysis, pls discriminant analysis, linear svm, logistic regression
Components : NAIVE BAYES CONTINUOUS, BINARY LOGISTIC REGRESSION, SVM, C-PLS, LINEAR DISCRIMINANT ANALYSIS
Dataset: breast ; low birth weight
Wikipedia, "Naive bayes classifier"
Tanagra, "Naive bayes classifier for discrete predictors"
Tuesday, October 19, 2010
Enhancement of the reporting module.
Thursday, October 14, 2010
In this tutorial, we are interested in correlation based filter approaches for discrete predictors. The goal is to highlight the most relevant subset of predictors which are highly correlated with the target attribute and, in the same time, which are weakly correlated between them i.e. which are not redundant. To evaluate the behavior of the various methods, we use an artificial dataset where we add irrelevant and redundant candidate variables. Then, we perform a feature selection based on the approaches analyzed. We compare the generalization error rate of the naive bayes classifier learned from the various subsets of selected variables. We lead the experimentation with Tanagra in a first time. Then, in a second time, we show how to perform the same analysis with other tools (Weka 3.6.0, Orange 2.0b, RapidMiner 4.6.0, R 2.9.2 - package FSelector).
Keywords: filter, feature selection, correlation based measure, discrete predictors, naive bayes classifier, bootstrap
Components: FEATURE RANKING, CFS FILTERING, MIFS FILTERING, FCBF FILTERING, MODTREE FILTERING, NAIVE BAYES, BOOTSTRAP
Tanagra, "Feature Selection"
Monday, August 30, 2010
Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina using OLE is prohibitive when the database size grows.
Thus, XL-SIPINA is essentially an attempt short-lived. There is always a bit of nostalgia when I am back on solutions I have explored, and I have finally abandoned. Can be also I have not completely explored this solution.
Last, the application was initially developed for Office 97. I note that it still up to date today, it works fine with Office 2010.
Keywords: excel, tableur, sipina, xls, xlsx, xl-sipina, decision tree induction
Download XL-SIPINA: XL-SIPINA
Friday, August 27, 2010
It is possible to import different types of formats into SIPINA. About Excel workbooks, one particular device has been implemented.
An add-in is automatically copied to the computer during the installation process. It must be integrated into Excel. The add-in incorporates a new menu into Excel. After selecting the data range, the user only has to activate it, this leads to the following: (1) SIPINA starts automatically, (2) the data are transferred via the clipboard and (3) SIPINA considers the first row of the range of cells corresponds to the names of variables, (4) columns with numerical values of the variables are quantitative (5) columns with alphanumeric values are categorical variables.
Unlike the other tutorials, the sequence of manipulations is described in a video. The description is right only for the versions up to Excel 2003. Another tutorial about the using of the add-in under Office 2007 and Office 2010 is described below.
Keywords: excel file format, add-in, decision tree
Installing the add-in : sipina_xla_installation.htm
Using the add-in: sipina_xla_processing.htm
The installation and the use of the "tanagra.xla" add-in under the previous versions of Office are described elsewhere (Office 1997 to Office 2003). This description is obsolete for the latest version of Office because the organization of the menus is modified for these versions i.e. Office 2007 and Office 2010. And yet, the add-in is still operational. In this tutorial, we show how to install and to use the Tanagra add-in under Office 2007 and 2010.
This transition to recent versions of Excel is absolutely not without consequences. Indeed, compared to the previous Excel versions, Excel 2007 (and 2010) and can handle more important rows and columns. We can process a dataset up to 1,048,575 observations (the first line corresponds to the variable names) and 16,384 variables. In this tutorial, we will treat a database with 100,000 observations and 22 variables (wave100k.xlsx). This is a version of the famous waveform database. Note that this file, because of the number of rows, cannot be manipulated by earlier versions of Excel.
The process described in this document is also valid for the SIPINA add-in (sipina.xla).
Keywords: data importation, excel, add-in
Components: VIEW DATASET
Tanagra, "Tanagra and Sipina add-ins for Excel 2016", June 2016.
Tanagra, "Excel file handling using an add-in".
Tanagra, "OOo Calc file handling using an add-in".
Tanagra, "Launching Tanagra from OOo Calc under Linux".
Tanagra, "Sipina add-in for Excel"
Saturday, July 24, 2010
We introduce in Tanagra (version 1.4.36 and later) a new presentation of the results of the learning process. The classifier is easier to understand, and its deployment is also made easier.
In the first part of this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the model) to those obtained with other linear approaches such as the logistic regression, the linear discriminant analysis and the linear SVM. We note that the results are highly consistent. This largely explains the good performance of the method in comparison to others.
In the second part, we use various tools on the same dataset (Weka 3.6.0, R 2.9.2, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0). We try above all to understand the obtained results.
Keywords: naive bayes, linear classifier, linear discriminant analysis, logistic regression, linear support vector machine, svm
Components: NAIVE BAYES, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, SVM, 0_1_BINARIZE
Wikipedia, "Naive bayes classifier".
T. Mitchell, "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression", in Machine Learning, Chapter 1, 2005.
Wednesday, July 21, 2010
Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial tools (e.g. SPSS Modeler, SAS Enterprise Miner, STATISTICA Data Miner…), we can very easily specify the whole sequence, even if we are not especially familiarized with data mining tools.
Keywords: decision tree, classification tree, interactive decision tree, spad, sipina, r software
R Project, http://www.r-project.org/
Monday, July 12, 2010
For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a good classifier, especially because it does not supply a degree of membership to the classes (Note: in fact, it assigns the same degree of membership to all the instances).
A strategy enables to improve the behavior of the learning algorithms facing to the imbalance problem is to artificially balance the dataset. We can do this by eliminating some instances of the over-sized class (downsizing) or by duplicating some instances of the small class (over sampling). But few persons analyze the consequence of this solution on the performance of the classifier.
In this tutorial, we highlight the consequences of the downsizing on the behavior of the logistic regression.
Keywords: imbalanced dataset, logistic regression, over sampling, under sampling
Components: BINARY LOGISTIC REGRESSION, DISCRETE SELECT EXAMPLES, SCORING, RECOVER EXAMPLES, ROC CURVE, TEST
Tutorial : en_Tanagra_Imbalanced_Dataset.pdf
Dataset : imbalanced_dataset.xls
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.
Wednesday, June 9, 2010
In this tutorial, we describe the great "filehash" package for R. It allows to copy (to dump) any kind of R objects into a file. We can handle these objects without loading them into main memory. This is especially useful for the data frame object. Indeed, we can perform a statistical analysis with the usual functions directly from a database on the disk. The processing capacities are vastly improved and, in the same time, we will note that the increase in computation time remains moderate.
To evaluate the "filehash" solution, we analyze the memory occupation and the computation time, with and without utilization of the package, during the performing of decision tree learning with rpart (rpart package) and a linear discriminant analysis with lda (MASS package). We perform the same experiments using SIPINA. Indeed, it provides also a swapping system (the data is dumped from the main memory to temporary files) for the handling of very large dataset. We can then compare the performances of the various solutions.
Keywords: very large dataset, filehash, decision tree, linear discriminant analysis, sipina, C4.5, rpart, lda
Données : wave2M.txt.zip
R package, "Filehash : Simple key-value database"
Yu-Sung Su's Blog, "Dealing with large dataset in R"
Tanagra Tutorial, "MapReduce with R", February 2015.
Tanagra Tutorial, "R programming under Hadoop", April 2015.
Thursday, May 27, 2010
We deal with a credit scoring problem. We try to determine by using logistic regression the factors underlying the agreement or refusal of a credit to customers. We perform the following steps:
- Estimating the parameters of the classifier;
- Retrieving the covariance matrix of coefficients;
- Assessment using the Hosmer and Lemeshow goodness of fit test;
- Assessment using the reliability diagram;
- Assessment using the ROC curve;
- Analysis of residuals, detection of outliers and influential points.
On the one hand, we use Tanagra 1.4.33. Then, on the other hand, we perform the same analysis using the R 2.9.2 software [glm(.) procedure].
Keywords: logistic regression, residual analysis, outliers, influential points, pearson residual, deviance residual, leverage, cook's distance, dfbeta, dfbetas, hosmer-lemeshow goodness of fit test, reliability diagram, calibration plot, glm()
Components: BINARY LOGISTIC REGRESSION, HOSMER LEMESHOW TEST, RELIABILITY DIAGRAM, LOGISTIC REGRESSION RESIDUALS
D. Garson, "Logistic Regression"
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.
Friday, May 21, 2010
The best discretization is the one performed by an expert domain. Indeed, he takes into account other information than those only provided by the available dataset. Unfortunately, this kind of approach is not always feasible because: often, the domain knowledge is not available or it does not allow to determine the appropriate discretization; the process cannot be automated to handle a large number of attributes. So, we are often forced to found the determination of the best discretization on a numerical process.
Discretization of continuous features as preprocessing for supervised learning process. First, we must define the context in which we perform the transformation. Depending on the circumstances, it is clear that the process and criteria used will not be the same. In this tutorial, we are in the supervised learning framework. We perform the discretization prior to the learning process i.e. we transform the continuous predictive attributes into discrete before to present them to a supervised learning algorithm. In this context, the construction of intervals in which one and only one of the values of the target attribute is the most represented is desirable. The relevance of the computed solution is often evaluated through an impurity based or an entropy based functions.
In this tutorial, we use only the univariate approaches. We compare the behavior of the supervised and the unsupervised algorithms on an artificial dataset. We use several tools for that: Tanagra 1.4.35, Sipina 3.3, R 2.9.2 (package dprep), Weka 3.6.0, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0. We highlight the settings of the algorithms and the reading of the results.
Keywords: mdlpc, discretization, supervised learning, equal frequency intervals, equal width intervals
Components: MDLPC, Supervised Learning, Decision List
F. Muhlenbach, R. Rakotomalala, « Discretization of Continuous Attributes », in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), pp. 397-402, 2005 (http://hal.archives-ouvertes.fr/hal-00383757/fr/).
Tanagra Tutorial, "Discretization and Naive Bayes Classifier"
Sunday, May 16, 2010
The SIPINA method is only available under the version 2.5 of SIPINA data mining tool. This version has some drawbacks. Among others, it cannot handle large datasets (higher than 16.383 instances). But it is the only tool which implements the decision graphs algorithm. This is the main reason for which this version is available online to date. If we want to implement a decision tree algorithm such as C4.5 or CHAID, or if we want to create interactively a decision tree , it is more advantageous to use the research version (named also version 3.0). The research version is more powerful and it supplies much functionality for the data exploration.
In this tutorial, we show how to implement the Sipina decision graph algorithm with the Sipina software version 2.5. We want to predict the low birth weight of newborns from the characteristics of their mothers. We want foremost to show how to use this 2.5 version which is not well documented. We want also to point out the interest of the decision graphs when we treat a small dataset i.e. when the data fragmentation becomes a crucial problem.
Keywords: decision graphs, decision trees, sipina version 2.5
Wikipedia, "Decision tree learning"
J. Oliver, Decision Graphs: An extension of Decision Trees, in Proc. of Int. Conf. on Artificial Intelligence and Statistics, 1993.
R. Rakotomalala, Graphes d'induction, PhD Dissertation, University Lyon 1, 1997 (URL: http://eric.univ-lyon2.fr/~ricco/publications.html; in french).
D. Zighed, R. Rakotomalala, Graphes d'induction : Apprentissage et Data Mining, Hermes, 2000 (in French).
Friday, May 14, 2010
This version, called 2.5, is online since 1995. Its development was suspended in 1998 when I started programming the version 3.0.
This version 2.5 is the only free tool which implements the decision graphs algorithm. This is a real curiosity in this respect. This is the reason for which I still distribute this version to date.
On the other hand, this 2.5 version has some severe limitations. Among others, it can handle only small dataset, up to 16.380 instances. If you want to implement a decision tree or if you want to handle a large dataset, it is always advised to use the current version (version 3.0 and later).
Setup of the old 2.5 version: Setup_Sipina_V25.exe
User's guide: EnglishDocSipinaV25.pdf
J. Oliver, "Decision Graphs - An Extension of Decision Trees", in Proc. Of the 4-th Int. workshop on Artificial Intelligence and Statistics, pages 343-350, 1993.
R. Rakotomalala, "Induction Graphs", PhD Thesis, University of Lyon 1, 1997 (in French).
D. Zighed, R. Rakotomalala, "Graphes d'Induction - Apprentissage et Data Mining", Hermes, 2000 (in French).
Monday, May 10, 2010
There are two steps when we want to treat this kind of problem: (1) detecting the presence of the collinearity; (2) implementing solutions in order to obtain more consistent results.
In this tutorial, we study three approaches to avoid the multicollinearity problem: the variable selection; the regression on the latent variables provided by PCA (principal component analysis); the PLS regression (partial least squares).
Keywords: linear regression, multiple regression, collinearity, multicollinearity, principal component analysis, PCA, PLS regression
Component : Multiple linear regression, Linear Correlation, Forward Entry Regression, Principal Component Analysis, PLS Regression, PLS Selection, PLS Conf. Interval
Monday, April 26, 2010
The new representation space maintains the proximity between the examples. The new features known as "factors" or "latent variables", which are a linear combination of the original descriptors, have several advantageous properties: (a) their interpretation very often allows to detect patterns in the initial space; (b) a very reduced number of factors allows to restore information contained in the data, we can moreover remove the noise from the dataset by using only the most relevant factors (it is a sort of regularization by smoothing the information provided by the dataset); (c) the new features form an orthogonal basis, learning algorithms such as linear discriminant analysis have a better behavior.
This approach has a connection to the reduced-rank linear discriminant analysis. But, instead to this last one, the class information is not needed during the computations of the principal components. The computation can be very fast using an appropriate algorithm when we deal with very high-dimensional dataset (such as NIPALS). But, on the other hand, it seems that the standard reduced-rank LDA tends to be better in terms of classification accuracy.
Keywords: linear discriminant analysis, principal component analysis, reduced-rank linear discriminant analysis
Components: Supervised Learning, Linear discriminant analysis, Principal Component Analysis, Scatterplot, Train-test
Wikipedia, "Linear discriminant analysis".
Thursday, April 22, 2010
Especially, it is important to detail the reason of the data preparation and the reading of the results. To have a reference, we compare the results with those provided by the rule induction tool proposed by Tanagra.
Scientific papers about the method are available on line.
Keywords: induction of rules, supervised learning, fuzzy rules
Components: SAMPLING, RULE INDUCTION, TEST
M.R. Berthold, « Mixed fuzzy rule formation », International Journal of Approximate Reasonning, 32, pp. 67-84, 2003.
T.R. Gabriel, M.R. Berthold, « Influence of fuzzy norms and other heuristics on mixed fuzzy rule formation », International Journal of Approximate Reasoning, 35, pp.195-202, 2004.
Friday, April 16, 2010
The approach is as follows: (1) we use the training set for the selection of the most relevant variables for classification; (2) we learn the model on selected descriptors; (3) we assess the performance on a test set containing all the descriptors.
This third point is very important. We cannot know the variables that will be finally selected. We do not have to manually prepare the test file by including only those which have been selected by the wrapper procedure. This is essential for the automation of the process. Indeed, otherwise, each change of setting in the wrapper procedure leading to another subset of descriptors would require us to manually edit the test file. This is very tedious.
In the light of this specification, it appeared that only Knime was able to implement the complete process. With the other tools, it is possible to select the relevant variables on the training file. But, I could not (or I did not know) apply the model on a test file containing all the original variables.
The naive bayes classifier is the learning method used in this tutorial .
Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, knime, weka, rapidminer
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.
Wikipedia, "Naive bayes classifier".
Tuesday, March 30, 2010
Three kinds of approaches are often highlighted into the literature. Among them, the WRAPPER approach uses explicitly a performance criterion during the search of the best subset of descriptors. Most often, this is the error rate. But in reality, any kind of criteria can be used. This may be the cost if we use a misclassification cost matrix. It can be the area under curve (AUC) when we assess the classifier using ROC curves, etc. In this case, the learning method is considered as a black box. We try various subsets of predictors. We will choose the one that optimizes the criterion.
In this tutorial, we implement the WRAPPER approach with SIPINA and R 2.9.2. For this last one, we give the source code for a forward search strategy. The readers can easily adapt the program to other dataset. Moreover, a careful reading of the source code for R gives a better understanding about the calculations made internally by SIPINA.
The WRAPPER strategy is a priori the best since it explicitly optimizes the performance criterion. We verify this by comparing the results with those provided by the FILTER approach (FCBF method) available into TANAGRA. The conclusions are not as obvious as one can think.
Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, fcbf, sipina, R software, RWeka paclage
Components: DISCRETE SELECT EXAMPLES, FCBF FILTERING, NAIVE BAYES, TEST
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.
Tuesday, March 23, 2010
Naive Bayes was modified. It now described a prediction model in an explicit form (in a linear combination form), easy to understand and to deploy.
Thursday, February 11, 2010
Among the rule induction methods, the "separate and conquer" approaches are very popular during the 90's. Curiously, they are less present today into proceedings or journals. More troublesome still, they are not implemented in commercial software. They are only available in free tools from the Machine Learning community. However, they have several advantages compared to other techniques.
In this tutorial, we describe first two separate and conquer algorithms for the rule induction process. Then, we show the behavior of the classification rules algorithms implemented in various tools such as Tanagra 1.4.34, Sipina Research 3.3, Weka 3.6.0, R 2.9.2 with the RWeka package, RapidMiner 4.6, or Orange 2.0b.
Keywords: rule induction, separate and conquer, top-down, CN2, decision tree
Composants : SAMPLING, DECISION LIST, RULE INDUCTION, TEST
J. Furnkranz, "Separate-and-conquer Rule Learning", Artificial Intelligence Review, Volume 13, Issue 1, pages 3-54, 1999.
P. Clark, T. Niblett, "The CN2 Rule Induction Algorithm", Machine Learning, 3(4):261-283, 1989.
P. Clark, R. Boswell, "Rule Induction with CN2: Some recent improvements", Machine Learning - EWSL-91, pages 151-163, Springer Verlag, 1991.
Tuesday, January 19, 2010
CTP. The method of detection of the right size of the tree is modified for the "Clustering Tree" with post-pruning component (CTP). It relies both on the angle between half-lines at each point on the curve of decreasing the WSS (within-group sum of squares) on the growing sample and the decrease of the same indicator computed on the pruning sample. Compared to the previous implementation, it results in a smaller number of clusters.
Regression Tree. The previous modification is incorporated into the Regression Tree component which is a univariate version of CTP.
C-RT Regression Tree. A new regression tree component was added. It faithfully implements the technique described in the Breiman's and al. (1984) book, including the post-pruning part with the 1-SE Rule (Chapter 8, especially p. 226 about the formula for the variance of the MSE).
C-RT. The report of the induction of decision tree C-RT has been completed. Based on the last column of the post-pruning table, it becomes easier to choose the parameter x (in x-SE Rule) to arbitrarily define the size of the pruned tree.
Some tutorials will describe these various changes soon.
Monday, January 4, 2010
To overcome this limitation, we should design solutions that allow to copy all or part of the data on disk, and perform treatments by loading into memory only what is necessary at each step of the algorithm (the instances and/or the variables). If the solution is theoretically simple, it is difficult in practice. Indeed, the processing time should remain reasonable even if we increase the disk access. It is very difficult to implement a strategy that is effective regardless of the learning algorithm used (supervised learning, clustering, factorial analysis, etc.). They handle the data in very different way: some of them use intensively matrix operations; the others search mainly the co-occurrence between attribute-value pairs, etc.
In this tutorial, we present a specific solution in the induction tree context. The solution is integrated into SIPINA (as optional) because its internal data structure is especially intended to the decision tree induction. Developing an approach which takes advantages of the specificities of the learning algorithm was easy in this context. We show that it is then possible to handle a very large dataset (41 variables and 9,634,198 observations) and to use all the functionalities of the tool (interactive construction of the tree, local descriptive statistics on nodes, etc.).
To fully appreciate the solution proposed by Sipina, we compare its behavior to generalist data mining tools such as Tanagra 1.4.33 or Knime 2.03.
Keywords: very large dataset, decision tree, sampling, sipina, knime
Lien : en_Sipina_Large_Dataset.pdf
Données : twice-kdd-cup-discretized-descriptors.zip
Tanagra, « Decision tree and large dataset ».
Tanagra, « Local sampling for decision tree learning »
Saturday, January 2, 2010
Among the many variants of decision trees learning algorithms, CART is probably the one that detects better the right size of the tree.
In this tutorial, we describe the selection mechanism used by CART during the post-pruning process. We show also how to set the appropriate value of the parameter of the algorithm in order to obtain a specific (a user-defined) tree.
Keywords: decision tree, CART, 1-SE Rule, post-pruning
Components: Discrete select examples, Supervised Learning, C-RT, Test
L. Breiman, J. Friedman, R. Olshen, C. Stone, " Classification and Regression Trees ", California : Wadsworth International, 1984.
R. Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (tutoriel_arbre_revue_modulad_33.pdf)