This Web log maintains an alternative layout of the tutorials about Tanagra. Each entry describes shortly the subject, it is followed by the link to the tutorial (pdf) and the dataset. The technical references (book, papers, website,...) are also provided. In some tutorials, we compare the results of Tanagra with other free software such as Knime, Orange, R software, Python, Sipina or Weka.
Wednesday, June 9, 2010
Handling large dataset in R - The "filehash" package
In this tutorial, we describe the great "filehash" package for R. It allows to copy (to dump) any kind of R objects into a file. We can handle these objects without loading them into main memory. This is especially useful for the data frame object. Indeed, we can perform a statistical analysis with the usual functions directly from a database on the disk. The processing capacities are vastly improved and, in the same time, we will note that the increase in computation time remains moderate.
To evaluate the "filehash" solution, we analyze the memory occupation and the computation time, with and without utilization of the package, during the performing of decision tree learning with rpart (rpart package) and a linear discriminant analysis with lda (MASS package). We perform the same experiments using SIPINA. Indeed, it provides also a swapping system (the data is dumped from the main memory to temporary files) for the handling of very large dataset. We can then compare the performances of the various solutions.
Keywords: very large dataset, filehash, decision tree, linear discriminant analysis, sipina, C4.5, rpart, lda
Tutorial: en_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf
Données : wave2M.txt.zip
References :
R package, "Filehash : Simple key-value database"
Yu-Sung Su's Blog, "Dealing with large dataset in R"
Tanagra Tutorial, "MapReduce with R", February 2015.
Tanagra Tutorial, "R programming under Hadoop", April 2015.
Thursday, May 27, 2010
Logistic Regression Diagnostics
We deal with a credit scoring problem. We try to determine by using logistic regression the factors underlying the agreement or refusal of a credit to customers. We perform the following steps:
- Estimating the parameters of the classifier;
- Retrieving the covariance matrix of coefficients;
- Assessment using the Hosmer and Lemeshow goodness of fit test;
- Assessment using the reliability diagram;
- Assessment using the ROC curve;
- Analysis of residuals, detection of outliers and influential points.
On the one hand, we use Tanagra 1.4.33. Then, on the other hand, we perform the same analysis using the R 2.9.2 software [glm(.) procedure].
Keywords: logistic regression, residual analysis, outliers, influential points, pearson residual, deviance residual, leverage, cook's distance, dfbeta, dfbetas, hosmer-lemeshow goodness of fit test, reliability diagram, calibration plot, glm()
Components: BINARY LOGISTIC REGRESSION, HOSMER LEMESHOW TEST, RELIABILITY DIAGRAM, LOGISTIC REGRESSION RESIDUALS
Tutorial: en_Tanagra_Logistic_Regression_Diagnostics.pdf
Dataset: logistic_regression_diagnostics.zip
References :
D. Garson, "Logistic Regression"
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.
Friday, May 21, 2010
Discretization of continuous features
The best discretization is the one performed by an expert domain. Indeed, he takes into account other information than those only provided by the available dataset. Unfortunately, this kind of approach is not always feasible because: often, the domain knowledge is not available or it does not allow to determine the appropriate discretization; the process cannot be automated to handle a large number of attributes. So, we are often forced to found the determination of the best discretization on a numerical process.
Discretization of continuous features as preprocessing for supervised learning process. First, we must define the context in which we perform the transformation. Depending on the circumstances, it is clear that the process and criteria used will not be the same. In this tutorial, we are in the supervised learning framework. We perform the discretization prior to the learning process i.e. we transform the continuous predictive attributes into discrete before to present them to a supervised learning algorithm. In this context, the construction of intervals in which one and only one of the values of the target attribute is the most represented is desirable. The relevance of the computed solution is often evaluated through an impurity based or an entropy based functions.
In this tutorial, we use only the univariate approaches. We compare the behavior of the supervised and the unsupervised algorithms on an artificial dataset. We use several tools for that: Tanagra 1.4.35, Sipina 3.3, R 2.9.2 (package dprep), Weka 3.6.0, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0. We highlight the settings of the algorithms and the reading of the results.
Keywords: mdlpc, discretization, supervised learning, equal frequency intervals, equal width intervals
Components: MDLPC, Supervised Learning, Decision List
Tutorial: en_Tanagra_Discretization_for_Supervised_Learning.pdf
Dataset: data-discretization.arff
References :
F. Muhlenbach, R. Rakotomalala, « Discretization of Continuous Attributes », in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), pp. 397-402, 2005 (http://hal.archives-ouvertes.fr/hal-00383757/fr/).
Tanagra Tutorial, "Discretization and Naive Bayes Classifier"
Sunday, May 16, 2010
Sipina Decision Graph Algorithm (case study)
The SIPINA method is only available under the version 2.5 of SIPINA data mining tool. This version has some drawbacks. Among others, it cannot handle large datasets (higher than 16.383 instances). But it is the only tool which implements the decision graphs algorithm. This is the main reason for which this version is available online to date. If we want to implement a decision tree algorithm such as C4.5 or CHAID, or if we want to create interactively a decision tree , it is more advantageous to use the research version (named also version 3.0). The research version is more powerful and it supplies much functionality for the data exploration.
In this tutorial, we show how to implement the Sipina decision graph algorithm with the Sipina software version 2.5. We want to predict the low birth weight of newborns from the characteristics of their mothers. We want foremost to show how to use this 2.5 version which is not well documented. We want also to point out the interest of the decision graphs when we treat a small dataset i.e. when the data fragmentation becomes a crucial problem.
Keywords: decision graphs, decision trees, sipina version 2.5
Tutorial: en_sipina_method.pdf
Dataset: low_birth_weight_v4.xls
References:
Wikipedia, "Decision tree learning"
J. Oliver, Decision Graphs: An extension of Decision Trees, in Proc. of Int. Conf. on Artificial Intelligence and Statistics, 1993.
R. Rakotomalala, Graphes d'induction, PhD Dissertation, University Lyon 1, 1997 (URL: http://eric.univ-lyon2.fr/~ricco/publications.html; in french).
D. Zighed, R. Rakotomalala, Graphes d'induction : Apprentissage et Data Mining, Hermes, 2000 (in French).
Friday, May 14, 2010
User's guide for the old Sipina 2.5 version
This version, called 2.5, is online since 1995. Its development was suspended in 1998 when I started programming the version 3.0.
This version 2.5 is the only free tool which implements the decision graphs algorithm. This is a real curiosity in this respect. This is the reason for which I still distribute this version to date.
On the other hand, this 2.5 version has some severe limitations. Among others, it can handle only small dataset, up to 16.380 instances. If you want to implement a decision tree or if you want to handle a large dataset, it is always advised to use the current version (version 3.0 and later).
Setup of the old 2.5 version: Setup_Sipina_V25.exe
User's guide: EnglishDocSipinaV25.pdf
References:
J. Oliver, "Decision Graphs - An Extension of Decision Trees", in Proc. Of the 4-th Int. workshop on Artificial Intelligence and Statistics, pages 343-350, 1993.
R. Rakotomalala, "Induction Graphs", PhD Thesis, University of Lyon 1, 1997 (in French).
D. Zighed, R. Rakotomalala, "Graphes d'Induction - Apprentissage et Data Mining", Hermes, 2000 (in French).
Monday, May 10, 2010
Solutions for multicollinearity in multiple regression
There are two steps when we want to treat this kind of problem: (1) detecting the presence of the collinearity; (2) implementing solutions in order to obtain more consistent results.
In this tutorial, we study three approaches to avoid the multicollinearity problem: the variable selection; the regression on the latent variables provided by PCA (principal component analysis); the PLS regression (partial least squares).
Keywords: linear regression, multiple regression, collinearity, multicollinearity, principal component analysis, PCA, PLS regression
Component : Multiple linear regression, Linear Correlation, Forward Entry Regression, Principal Component Analysis, PLS Regression, PLS Selection, PLS Conf. Interval
Tutorial: en_Tanagra_Regression_Colinearity.pdf
Dataset: car_consumption_colinearity_regression.xls
References :
Wikipedia, "Multicollinearity"
Monday, April 26, 2010
Linear discriminant analysis on PCA factors
The new representation space maintains the proximity between the examples. The new features known as "factors" or "latent variables", which are a linear combination of the original descriptors, have several advantageous properties: (a) their interpretation very often allows to detect patterns in the initial space; (b) a very reduced number of factors allows to restore information contained in the data, we can moreover remove the noise from the dataset by using only the most relevant factors (it is a sort of regularization by smoothing the information provided by the dataset); (c) the new features form an orthogonal basis, learning algorithms such as linear discriminant analysis have a better behavior.
This approach has a connection to the reduced-rank linear discriminant analysis. But, instead to this last one, the class information is not needed during the computations of the principal components. The computation can be very fast using an appropriate algorithm when we deal with very high-dimensional dataset (such as NIPALS). But, on the other hand, it seems that the standard reduced-rank LDA tends to be better in terms of classification accuracy.
Keywords: linear discriminant analysis, principal component analysis, reduced-rank linear discriminant analysis
Components: Supervised Learning, Linear discriminant analysis, Principal Component Analysis, Scatterplot, Train-test
Tutorial: en_dr_utiliser_axes_factoriels_descripteurs.pdf
Dataset: dr_waveform.bdm
References:
Wikipedia, "Linear discriminant analysis".
Thursday, April 22, 2010
Induction of fuzzy rules using Knime
Especially, it is important to detail the reason of the data preparation and the reading of the results. To have a reference, we compare the results with those provided by the rule induction tool proposed by Tanagra.
Scientific papers about the method are available on line.
Keywords: induction of rules, supervised learning, fuzzy rules
Components: SAMPLING, RULE INDUCTION, TEST
Tutorial: en_Tanagra_Induction_Regles_Floues_Knime.pdf
Dataset: iris2D.txt
References :
M.R. Berthold, « Mixed fuzzy rule formation », International Journal of Approximate Reasonning, 32, pp. 67-84, 2003.
T.R. Gabriel, M.R. Berthold, « Influence of fuzzy norms and other heuristics on mixed fuzzy rule formation », International Journal of Approximate Reasoning, 35, pp.195-202, 2004.
Friday, April 16, 2010
"Wrapper" for feature selection (continuation)
The approach is as follows: (1) we use the training set for the selection of the most relevant variables for classification; (2) we learn the model on selected descriptors; (3) we assess the performance on a test set containing all the descriptors.
This third point is very important. We cannot know the variables that will be finally selected. We do not have to manually prepare the test file by including only those which have been selected by the wrapper procedure. This is essential for the automation of the process. Indeed, otherwise, each change of setting in the wrapper procedure leading to another subset of descriptors would require us to manually edit the test file. This is very tedious.
In the light of this specification, it appeared that only Knime was able to implement the complete process. With the other tools, it is possible to select the relevant variables on the training file. But, I could not (or I did not know) apply the model on a test file containing all the original variables.
The naive bayes classifier is the learning method used in this tutorial .
Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, knime, weka, rapidminer
Tutorial: en_Tanagra_Wrapper_Continued.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.
Wikipedia, "Naive bayes classifier".
Tuesday, March 30, 2010
"Wrapper" for feature selection
Three kinds of approaches are often highlighted into the literature. Among them, the WRAPPER approach uses explicitly a performance criterion during the search of the best subset of descriptors. Most often, this is the error rate. But in reality, any kind of criteria can be used. This may be the cost if we use a misclassification cost matrix. It can be the area under curve (AUC) when we assess the classifier using ROC curves, etc. In this case, the learning method is considered as a black box. We try various subsets of predictors. We will choose the one that optimizes the criterion.
In this tutorial, we implement the WRAPPER approach with SIPINA and R 2.9.2. For this last one, we give the source code for a forward search strategy. The readers can easily adapt the program to other dataset. Moreover, a careful reading of the source code for R gives a better understanding about the calculations made internally by SIPINA.
The WRAPPER strategy is a priori the best since it explicitly optimizes the performance criterion. We verify this by comparing the results with those provided by the FILTER approach (FCBF method) available into TANAGRA. The conclusions are not as obvious as one can think.
Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, fcbf, sipina, R software, RWeka paclage
Components: DISCRETE SELECT EXAMPLES, FCBF FILTERING, NAIVE BAYES, TEST
Tutorial: en_Tanagra_Sipina_Wrapper.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.
Tuesday, March 23, 2010
Tanagra - Version 1.4.36
Naive Bayes was modified. It now described a prediction model in an explicit form (in a linear combination form), easy to understand and to deploy.
Thursday, February 11, 2010
Supervised rule induction - Software comparison
Among the rule induction methods, the "separate and conquer" approaches are very popular during the 90's. Curiously, they are less present today into proceedings or journals. More troublesome still, they are not implemented in commercial software. They are only available in free tools from the Machine Learning community. However, they have several advantages compared to other techniques.
In this tutorial, we describe first two separate and conquer algorithms for the rule induction process. Then, we show the behavior of the classification rules algorithms implemented in various tools such as Tanagra 1.4.34, Sipina Research 3.3, Weka 3.6.0, R 2.9.2 with the RWeka package, RapidMiner 4.6, or Orange 2.0b.
Keywords: rule induction, separate and conquer, top-down, CN2, decision tree
Composants : SAMPLING, DECISION LIST, RULE INDUCTION, TEST
Tutorial: en_Tanagra_Rule_Induction.pdf
Dataset: life_insurance.zip
References:
J. Furnkranz, "Separate-and-conquer Rule Learning", Artificial Intelligence Review, Volume 13, Issue 1, pages 3-54, 1999.
P. Clark, T. Niblett, "The CN2 Rule Induction Algorithm", Machine Learning, 3(4):261-283, 1989.
P. Clark, R. Boswell, "Rule Induction with CN2: Some recent improvements", Machine Learning - EWSL-91, pages 151-163, Springer Verlag, 1991.
Tuesday, January 19, 2010
Tanagra - Version 1.4.35
CTP. The method of detection of the right size of the tree is modified for the "Clustering Tree" with post-pruning component (CTP). It relies both on the angle between half-lines at each point on the curve of decreasing the WSS (within-group sum of squares) on the growing sample and the decrease of the same indicator computed on the pruning sample. Compared to the previous implementation, it results in a smaller number of clusters.
Regression Tree. The previous modification is incorporated into the Regression Tree component which is a univariate version of CTP.
C-RT Regression Tree. A new regression tree component was added. It faithfully implements the technique described in the Breiman's and al. (1984) book, including the post-pruning part with the 1-SE Rule (Chapter 8, especially p. 226 about the formula for the variance of the MSE).
C-RT. The report of the induction of decision tree C-RT has been completed. Based on the last column of the post-pruning table, it becomes easier to choose the parameter x (in x-SE Rule) to arbitrarily define the size of the pruned tree.
Some tutorials will describe these various changes soon.
Monday, January 4, 2010
Dealing with very large dataset in Sipina
To overcome this limitation, we should design solutions that allow to copy all or part of the data on disk, and perform treatments by loading into memory only what is necessary at each step of the algorithm (the instances and/or the variables). If the solution is theoretically simple, it is difficult in practice. Indeed, the processing time should remain reasonable even if we increase the disk access. It is very difficult to implement a strategy that is effective regardless of the learning algorithm used (supervised learning, clustering, factorial analysis, etc.). They handle the data in very different way: some of them use intensively matrix operations; the others search mainly the co-occurrence between attribute-value pairs, etc.
In this tutorial, we present a specific solution in the induction tree context. The solution is integrated into SIPINA (as optional) because its internal data structure is especially intended to the decision tree induction. Developing an approach which takes advantages of the specificities of the learning algorithm was easy in this context. We show that it is then possible to handle a very large dataset (41 variables and 9,634,198 observations) and to use all the functionalities of the tool (interactive construction of the tree, local descriptive statistics on nodes, etc.).
To fully appreciate the solution proposed by Sipina, we compare its behavior to generalist data mining tools such as Tanagra 1.4.33 or Knime 2.03.
Keywords: very large dataset, decision tree, sampling, sipina, knime
Components: ID3
Lien : en_Sipina_Large_Dataset.pdf
Données : twice-kdd-cup-discretized-descriptors.zip
Références :
Tanagra, « Decision tree and large dataset ».
Tanagra, « Local sampling for decision tree learning »
Saturday, January 2, 2010
CART - Determining the right size of the tree
Among the many variants of decision trees learning algorithms, CART is probably the one that detects better the right size of the tree.
In this tutorial, we describe the selection mechanism used by CART during the post-pruning process. We show also how to set the appropriate value of the parameter of the algorithm in order to obtain a specific (a user-defined) tree.
Keywords: decision tree, CART, 1-SE Rule, post-pruning
Components: Discrete select examples, Supervised Learning, C-RT, Test
Tutorial: en_Tanagra_Tree_Post_Pruning.pdf
Dataset: adult_cart_decision_trees.zip
References :
L. Breiman, J. Friedman, R. Olshen, C. Stone, " Classification and Regression Trees ", California : Wadsworth International, 1984.
R. Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (tutoriel_arbre_revue_modulad_33.pdf)