Tanagra - Data Mining and Data Science Tutorials: 2010

Thursday, December 9, 2010

Creating reports with Tanagra

The ability to create automatically reports from the results of an analysis is a valuable functionality for Data Mining. But this is rather an asset to the professional tools. The programming of this kind of functionality is not really promoted in the academic domain. I do not think that I can publish a paper in a journal where I describe the ability of Tanagra to create attractive reports. This is the reason for which the output of the academic tools, such as R or Weka, is mainly in a formatted text shape.

Tanagra, which is an academic tool, provides also text outputs. The programming remains simple if we see at a glance the source code. But, in order to make the presentation more attractive, it uses the HTML to format the results. I take advantage of this special feature to generate reports without making a particular programming effort. Tanagra is one of the few academic tools to be able to produce reports that can easily be displayed in office automation software. For instances, the tables can be copied into Excel spreadsheets for further calculations. More generally, the results can be viewed in a browser, regardless of data mining software.

These are the reporting features of Tanagra that we present in this tutorial.

Keywords: reporting, decision tree, c4.5, logistic regression, binary coding, roc curve, learning sample, test sample, forward, feature selection
Components: GROUP CHARACTERIZATION, SAMPLING, C4.5, TEST, O_1_BINARIZE, FORWARD-LOGIT, BINARY LOGISTIC REGRESSION, SCORING, ROC CURVE
Tutorial: en_Tanagra_Reporting.pdf
Dataset: heart disease

Wednesday, November 24, 2010

Multithreading for decision tree induction

Nowadays, much of modern personal computers (PC) have multicore processors. The computer operates as if it had multiple processors. Software and data mining algorithms must be modified in order to benefit of this new feature.

Currently, few free tools exploit this opportunity because it is impossible to define a generic approach that would be valid regardless of the learning method used. We must modify each existing learning algorithm. For a given technique, decomposing an algorithm into elementary tasks that can execute in parallel is a research field in itself. In a second step, we must adopt a programming technology which is easy to implement.

In this tutorial, I propose a technology based on threads for the induction of decision trees. It is well suited in our context for various reasons. (1) It is easy to program with the modern programming languages. (2) Threads can share information; they can also modify common objects. Efficient synchronization tools enable to avoid data corruption. (3) We can launch multiple threads on a mono-core and mono-processor system. It is not really advantageous, but at least the system does not crash. (4) On a multiprocessor or multi-core system, the threads will actually run at the same time, with each processor or core running a particular thread. But, because of the necessity of synchronization between threads, the computation time is not divided by the number of cores in this case.

First, we briefly present the modification of the decision tree learning algorithm in order to benefit of the multithreading technology. Then, we show how to implement the approach with SIPINA (version 3.5 and later). We show also that the multithreaded decision tree learners are available in various tools such as Knime 2.2.2 or RapidMiner 5.0.011. Last, we study the behavior of the multithreaded algorithms according to the dataset characteristics.

Keywords: multithreading, thread, threads, decision tree, chaid, sipina 3.5, knime 2.2.2, rapidminer 5.0.011
Tutorial: en_sipina_multithreading.pdf
Dataset: covtype.arff.zip
References :
Wikipedia, "Decision tree learning"
Wikipedia, "Thread (Computer science)"
Aldinucci, Ruggieri, Torquati, " Porting Decision Tree Algorithms to Multicore using FastFlow ", Pkdd-2010.

Thursday, November 11, 2010

Naive bayes classifier for continuous predictors

The naive bayes classifier is a very popular approach even if it is (apparently) based on an unrealistic assumption: the distributions of the predictors are mutually independent conditionally to the values of the target attribute. The main reason of this popularity is that the method proved to be as accurate as the other well-known approaches such as linear discriminant analysis or logistic regression on the majority of the real dataset.

But an obstacle to the utilization of the naive bayes classifier remains when we deal with a real problem. It seems that we cannot provide an explicit model for its deployment. The proposed representation by the PMML standard for instance is particularly unattractive. The interpretation of the model, especially the detection of the influence of each descriptor on the prediction of the classes is impossible.

This assertion is not entirely true. We have showed in a previous tutorial that we can extract an explicit model from the naive bayes classifier in the case of discrete predictors (see references). We obtain a linear combination of the binarized predictors. In this document, we show that the same mechanism can be implemented for the continuous descriptors. We use the standard Gaussian assumption for the conditional distribution of the descriptors. According to the heteroscedastic assumption or the homoscedastic assumption, we can provide a quadratic model or a linear model. This last one is especially interesting because we obtain a model that we can directly compare to the other linear classifiers (the sign and the values of the coefficients of the linear combination).

This tutorial is organized as follows. In the next section, we describe the approach. In the section 3, we show how to implement the method with Tanagra 1.4.37 (and later). We compare the results to those of the other linear methods. In the section 4, we compare the results provided by various data mining tools. We note that none of them proposes an explicit model that could be easy to deploy. They give only the estimated parameters of the conditional Gaussian distribution (mean and standard deviation). Last, in the section 5, we show the interest of the naive bayes classifier over the other linear methods when we handle a large dataset (the "mutant" dataset - 16,592 instances and 5,408 predictors). The computation time and the memory occupancy are clearly advantageous.

Keywords: naive bayes classifier, rapidminer 5.0.10, weka 3.7.2, knime 2.2.2, R software, package e1071, linear discriminant analysis, pls discriminant analysis, linear svm, logistic regression
Components : NAIVE BAYES CONTINUOUS, BINARY LOGISTIC REGRESSION, SVM, C-PLS, LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Naive_Bayes_Continuous_Predictors.pdf
Dataset: breast ; low birth weight
References :
Wikipedia, "Naive bayes classifier"
Tanagra, "Naive bayes classifier for discrete predictors"

Tuesday, October 19, 2010

Tanagra - Version 1.4.37

Naive Bayes Continuous is a supervised learning component. It implements the naive bayes principle for continuous predictors (gaussian assumption, heteroscedasticity or homoscedasticity). The main originality is that it provides an explicit model corresponding to a linear combination of predictors and, eventually, their square.

Enhancement of the reporting module.

Thursday, October 14, 2010

Filter methods for feature selection

The nature of the predictors' selection process has changed considerably. Previously, works in machine learning concentrated on the research of the best subset of features for a learning classifier, in the context where the number of candidate features was rather reduced and the computing time was not a major constraint. Today, it is common to deal with datasets comprising thousands of descriptors. Consequently, the problem of feature selection always consists in finding the most relevant subset of predictors but by introducing a new strong constraint: the computing time must remain reasonable.

In this tutorial, we are interested in correlation based filter approaches for discrete predictors. The goal is to highlight the most relevant subset of predictors which are highly correlated with the target attribute and, in the same time, which are weakly correlated between them i.e. which are not redundant. To evaluate the behavior of the various methods, we use an artificial dataset where we add irrelevant and redundant candidate variables. Then, we perform a feature selection based on the approaches analyzed. We compare the generalization error rate of the naive bayes classifier learned from the various subsets of selected variables. We lead the experimentation with Tanagra in a first time. Then, in a second time, we show how to perform the same analysis with other tools (Weka 3.6.0, Orange 2.0b, RapidMiner 4.6.0, R 2.9.2 - package FSelector).

Keywords: filter, feature selection, correlation based measure, discrete predictors, naive bayes classifier, bootstrap
Components: FEATURE RANKING, CFS FILTERING, MIFS FILTERING, FCBF FILTERING, MODTREE FILTERING, NAIVE BAYES, BOOTSTRAP
Tutorial: en_Tanagra_Filter_Method_Discrete_Predictors.pdf
Dataset: vote_filter_approach.zip
References:
Tanagra, "Feature Selection"

Monday, August 30, 2010

Connecting Sipina and Excel using OLE

The connection between a data mining tool and Excel (and more generally spreadsheet) is a very important issue. We had addressed many times this topic in our tutorials. With hindsight, I think the solution based on add-ins for Excel is the best one, both for SIPINA and for TANAGRA. It is simple, reliable and highly efficient. It does not require developing specific versions. The connection with Excel is a simple additional functionality of the standard distribution.

Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina using OLE is prohibitive when the database size grows.

Thus, XL-SIPINA is essentially an attempt short-lived. There is always a bit of nostalgia when I am back on solutions I have explored, and I have finally abandoned. Can be also I have not completely explored this solution.

Last, the application was initially developed for Office 97. I note that it still up to date today, it works fine with Office 2010.

Keywords: excel, tableur, sipina, xls, xlsx, xl-sipina, decision tree induction
Download XL-SIPINA: XL-SIPINA
Tutorial: en_xls_sipina.pdf
Dataset: autos

Friday, August 27, 2010

Sipina add-in for Excel

The data importation is a bottleneck for Data Mining Tools. The majority of users are working with a spreadsheet tool such as Excel, mainly in the coupling with specialized software for data mining (see KDnuggets polls). Therefore, a recurring issue for users is "how to send my data from Excel to SIPINA?"

It is possible to import different types of formats into SIPINA. About Excel workbooks, one particular device has been implemented.

An add-in is automatically copied to the computer during the installation process. It must be integrated into Excel. The add-in incorporates a new menu into Excel. After selecting the data range, the user only has to activate it, this leads to the following: (1) SIPINA starts automatically, (2) the data are transferred via the clipboard and (3) SIPINA considers the first row of the range of cells corresponds to the names of variables, (4) columns with numerical values of the variables are quantitative (5) columns with alphanumeric values are categorical variables.

Unlike the other tutorials, the sequence of manipulations is described in a video. The description is right only for the versions up to Excel 2003. Another tutorial about the using of the add-in under Office 2007 and Office 2010 is described below.

Keywords: excel file format, add-in, decision tree
Installing the add-in : sipina_xla_installation.htm
Using the add-in: sipina_xla_processing.htm

Tanagra add-in for Office 2007 and Office 2010

The "tanagra.xla" add-in for Excel contributes to the wide diffusion of Tanagra. The principle is simple. It is to embed a Tanagra menu in Excel. Thus the user can run statistical calculations without having to leave the spreadsheet. It seems simplistic. But this feature facilitates immensely the work of data miner. Indeed, the spreadsheet is one of the most used tools for preparing dataset (see KDNuggets Polls: Tools / Languages for Data Cleaning - 2008). By embedding the data mining tool in the spreadsheet environment, it avoids to the practitioner the tedious and repetitive manipulations: importing the dataset, exporting the dataset, checking the compatibilities between data file formats, etc.

The installation and the use of the "tanagra.xla" add-in under the previous versions of Office are described elsewhere (Office 1997 to Office 2003). This description is obsolete for the latest version of Office because the organization of the menus is modified for these versions i.e. Office 2007 and Office 2010. And yet, the add-in is still operational. In this tutorial, we show how to install and to use the Tanagra add-in under Office 2007 and 2010.

This transition to recent versions of Excel is absolutely not without consequences. Indeed, compared to the previous Excel versions, Excel 2007 (and 2010) and can handle more important rows and columns. We can process a dataset up to 1,048,575 observations (the first line corresponds to the variable names) and 16,384 variables. In this tutorial, we will treat a database with 100,000 observations and 22 variables (wave100k.xlsx). This is a version of the famous waveform database. Note that this file, because of the number of rows, cannot be manipulated by earlier versions of Excel.

The process described in this document is also valid for the SIPINA add-in (sipina.xla).

Keywords: data importation, excel, add-in
Components: VIEW DATASET
Tutorial: en_Tanagra_Add_In_Excel_2007_2010.pdf
Dataset: wave100k.xlsx
References:
Tanagra, "Tanagra and Sipina add-ins for Excel 2016", June 2016.
Tanagra, "Excel file handling using an add-in".
Tanagra, "OOo Calc file handling using an add-in".
Tanagra, "Launching Tanagra from OOo Calc under Linux".
Tanagra, "Sipina add-in for Excel"

Saturday, July 24, 2010

Naive bayes classifier for discrete predictors

The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Yet, despite this, it appears robust and efficient. Its performance is comparable to other supervised learning techniques.

We introduce in Tanagra (version 1.4.36 and later) a new presentation of the results of the learning process. The classifier is easier to understand, and its deployment is also made easier.

In the first part of this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the model) to those obtained with other linear approaches such as the logistic regression, the linear discriminant analysis and the linear SVM. We note that the results are highly consistent. This largely explains the good performance of the method in comparison to others.

In the second part, we use various tools on the same dataset (Weka 3.6.0, R 2.9.2, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0). We try above all to understand the obtained results.

Keywords: naive bayes, linear classifier, linear discriminant analysis, logistic regression, linear support vector machine, svm
Components: NAIVE BAYES, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, SVM, 0_1_BINARIZE
Tutorial: en_Tanagra_Naive_Bayes_Classifier_Explained.pdf
Dataset: heart_for_naive_bayes.zip
References :
Wikipedia, "Naive bayes classifier".
T. Mitchell, "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression", in Machine Learning, Chapter 1, 2005.

Wednesday, July 21, 2010

Interactive decision tree learning with Spad

In this tutorial, we will be interested in SPAD. This is a French software specialized in exploratory data analysis which evolved much these last years. We would perform a sequence of analysis from a dataset collected into 3 worksheets of a Excel data file: (1) we create a classification tree from the learning sample into the first worksheet, we try to analyze deeply some nodes of the tree to highlight the characteristics of covered instances, we try also to modify interactively (manually) the properties of some splitting operation; (2) we apply the classifier on unseen cases of the second worksheet; (3) we compare the prediction of the model with the actual values of the target attribute contained into the third worksheet.

Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial tools (e.g. SPSS Modeler, SAS Enterprise Miner, STATISTICA Data Miner…), we can very easily specify the whole sequence, even if we are not especially familiarized with data mining tools.

Keywords: decision tree, classification tree, interactive decision tree, spad, sipina, r software
Tutorial: en_Tanagra_Arbres_IDT_Spad.pdf
Dataset: pima-arbre-spad.zip
References :
SPAD, http://www.spad.eu/
SIPINA, http://eric.univ-lyon2.fr/~ricco/sipina.html
R Project, http://www.r-project.org/

Monday, July 12, 2010

Supervised learning from imbalanced dataset

In real problems, the classes are not equally represented in dataset. The instances corresponding to positive class, the one that we want to detect often, are few. For instance, in a fraud detection problem, there are a very few cases of fraud comparing to the large number of honest connections; in a medical problem, the ill persons are fortunately rare; etc. In these situations, using the standard learning process and assessing the classifier with the confusion matrix and the misclassification rate are not appropriate. We observe that the default classifier consisting to assign all the instances to the majority class is the one which minimizes the error rate.

For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a good classifier, especially because it does not supply a degree of membership to the classes (Note: in fact, it assigns the same degree of membership to all the instances).

A strategy enables to improve the behavior of the learning algorithms facing to the imbalance problem is to artificially balance the dataset. We can do this by eliminating some instances of the over-sized class (downsizing) or by duplicating some instances of the small class (over sampling). But few persons analyze the consequence of this solution on the performance of the classifier.

In this tutorial, we highlight the consequences of the downsizing on the behavior of the logistic regression.

Keywords: imbalanced dataset, logistic regression, over sampling, under sampling
Components: BINARY LOGISTIC REGRESSION, DISCRETE SELECT EXAMPLES, SCORING, RECOVER EXAMPLES, ROC CURVE, TEST
Tutorial : en_Tanagra_Imbalanced_Dataset.pdf
Dataset : imbalanced_dataset.xls
References :
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.

Wednesday, June 9, 2010

Handling large dataset in R - The "filehash" package

The processing of very large datasets is a crucial problem in data mining. To handle them, we must avoid to load the whole dataset into memory. The idea is quite simple: (1) we write all or a part of the dataset on the disk in a binary file format to allow a direct access; (2) the machine learning algorithms must be modified to efficiently access the values stored on the disk. Thus, the characteristics of the computer are no longer a bottleneck for the handling of a large dataset.

In this tutorial, we describe the great "filehash" package for R. It allows to copy (to dump) any kind of R objects into a file. We can handle these objects without loading them into main memory. This is especially useful for the data frame object. Indeed, we can perform a statistical analysis with the usual functions directly from a database on the disk. The processing capacities are vastly improved and, in the same time, we will note that the increase in computation time remains moderate.

To evaluate the "filehash" solution, we analyze the memory occupation and the computation time, with and without utilization of the package, during the performing of decision tree learning with rpart (rpart package) and a linear discriminant analysis with lda (MASS package). We perform the same experiments using SIPINA. Indeed, it provides also a swapping system (the data is dumped from the main memory to temporary files) for the handling of very large dataset. We can then compare the performances of the various solutions.

Keywords: very large dataset, filehash, decision tree, linear discriminant analysis, sipina, C4.5, rpart, lda
Tutorial: en_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf
Données : wave2M.txt.zip
References :
R package, "Filehash : Simple key-value database"
Yu-Sung Su's Blog, "Dealing with large dataset in R"
Tanagra Tutorial, "MapReduce with R", February 2015.
Tanagra Tutorial, "R programming under Hadoop", April 2015.

Thursday, May 27, 2010

Logistic Regression Diagnostics

This tutorial describes the implementation of tools for the diagnostic and the assessment of a logistic regression. These tools are available in Tanagra version 1.4.33 (and later).

We deal with a credit scoring problem. We try to determine by using logistic regression the factors underlying the agreement or refusal of a credit to customers. We perform the following steps:
- Estimating the parameters of the classifier;
- Retrieving the covariance matrix of coefficients;
- Assessment using the Hosmer and Lemeshow goodness of fit test;
- Assessment using the reliability diagram;
- Assessment using the ROC curve;
- Analysis of residuals, detection of outliers and influential points.

On the one hand, we use Tanagra 1.4.33. Then, on the other hand, we perform the same analysis using the R 2.9.2 software [glm(.) procedure].

Keywords: logistic regression, residual analysis, outliers, influential points, pearson residual, deviance residual, leverage, cook's distance, dfbeta, dfbetas, hosmer-lemeshow goodness of fit test, reliability diagram, calibration plot, glm()
Components: BINARY LOGISTIC REGRESSION, HOSMER LEMESHOW TEST, RELIABILITY DIAGRAM, LOGISTIC REGRESSION RESIDUALS
Tutorial: en_Tanagra_Logistic_Regression_Diagnostics.pdf
Dataset: logistic_regression_diagnostics.zip
References :
D. Garson, "Logistic Regression"
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.

Friday, May 21, 2010

Discretization of continuous features

The discretization transforms a continuous attribute into a discrete one. To do that, it partitions the range into a set of intervals by defining a set of cut points. Thus we must answer to two questions to lead this data transformation: (1) how to determine the right number of intervals; (2) how to compute the cut points. The resolution is not necessarily in that sequence.

The best discretization is the one performed by an expert domain. Indeed, he takes into account other information than those only provided by the available dataset. Unfortunately, this kind of approach is not always feasible because: often, the domain knowledge is not available or it does not allow to determine the appropriate discretization; the process cannot be automated to handle a large number of attributes. So, we are often forced to found the determination of the best discretization on a numerical process.

Discretization of continuous features as preprocessing for supervised learning process. First, we must define the context in which we perform the transformation. Depending on the circumstances, it is clear that the process and criteria used will not be the same. In this tutorial, we are in the supervised learning framework. We perform the discretization prior to the learning process i.e. we transform the continuous predictive attributes into discrete before to present them to a supervised learning algorithm. In this context, the construction of intervals in which one and only one of the values of the target attribute is the most represented is desirable. The relevance of the computed solution is often evaluated through an impurity based or an entropy based functions.

In this tutorial, we use only the univariate approaches. We compare the behavior of the supervised and the unsupervised algorithms on an artificial dataset. We use several tools for that: Tanagra 1.4.35, Sipina 3.3, R 2.9.2 (package dprep), Weka 3.6.0, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0. We highlight the settings of the algorithms and the reading of the results.

Keywords: mdlpc, discretization, supervised learning, equal frequency intervals, equal width intervals
Components: MDLPC, Supervised Learning, Decision List
Tutorial: en_Tanagra_Discretization_for_Supervised_Learning.pdf
Dataset: data-discretization.arff
References :
F. Muhlenbach, R. Rakotomalala, « Discretization of Continuous Attributes », in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), pp. 397-402, 2005 (http://hal.archives-ouvertes.fr/hal-00383757/fr/).
Tanagra Tutorial, "Discretization and Naive Bayes Classifier"

Sunday, May 16, 2010

Sipina Decision Graph Algorithm (case study)

SIPINA is a data mining tool. But it is also a machine learning method. It corresponds to an algorithm for the induction of decision graphs (see References, section 9). A decision graph is a generalization of a decision tree where we can merge any two terminal nodes of the graph, and not only the leaves issued from the same node.

The SIPINA method is only available under the version 2.5 of SIPINA data mining tool. This version has some drawbacks. Among others, it cannot handle large datasets (higher than 16.383 instances). But it is the only tool which implements the decision graphs algorithm. This is the main reason for which this version is available online to date. If we want to implement a decision tree algorithm such as C4.5 or CHAID, or if we want to create interactively a decision tree , it is more advantageous to use the research version (named also version 3.0). The research version is more powerful and it supplies much functionality for the data exploration.

In this tutorial, we show how to implement the Sipina decision graph algorithm with the Sipina software version 2.5. We want to predict the low birth weight of newborns from the characteristics of their mothers. We want foremost to show how to use this 2.5 version which is not well documented. We want also to point out the interest of the decision graphs when we treat a small dataset i.e. when the data fragmentation becomes a crucial problem.

Keywords: decision graphs, decision trees, sipina version 2.5
Tutorial: en_sipina_method.pdf
Dataset: low_birth_weight_v4.xls
References:
Wikipedia, "Decision tree learning"
J. Oliver, Decision Graphs: An extension of Decision Trees, in Proc. of Int. Conf. on Artificial Intelligence and Statistics, 1993.
R. Rakotomalala, Graphes d'induction, PhD Dissertation, University Lyon 1, 1997 (URL: http://eric.univ-lyon2.fr/~ricco/publications.html; in french).
D. Zighed, R. Rakotomalala, Graphes d'induction : Apprentissage et Data Mining, Hermes, 2000 (in French).

Friday, May 14, 2010

User's guide for the old Sipina 2.5 version

SIPINA has a long history. Before the current version (version 3.3, May 2010), we distributed a data mining tool dedicated exclusively to the induction of decision graphs, a generalization of decision trees. Of course, the state-of-the-art decision trees algorithms are also included (such as C4.5, CHAID).

This version, called 2.5, is online since 1995. Its development was suspended in 1998 when I started programming the version 3.0.

This version 2.5 is the only free tool which implements the decision graphs algorithm. This is a real curiosity in this respect. This is the reason for which I still distribute this version to date.

On the other hand, this 2.5 version has some severe limitations. Among others, it can handle only small dataset, up to 16.380 instances. If you want to implement a decision tree or if you want to handle a large dataset, it is always advised to use the current version (version 3.0 and later).

Setup of the old 2.5 version: Setup_Sipina_V25.exe
User's guide: EnglishDocSipinaV25.pdf
References:
J. Oliver, "Decision Graphs - An Extension of Decision Trees", in Proc. Of the 4-th Int. workshop on Artificial Intelligence and Statistics, pages 343-350, 1993.
R. Rakotomalala, "Induction Graphs", PhD Thesis, University of Lyon 1, 1997 (in French).
D. Zighed, R. Rakotomalala, "Graphes d'Induction - Apprentissage et Data Mining", Hermes, 2000 (in French).

Monday, May 10, 2010

Solutions for multicollinearity in multiple regression

Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with others (Wikipedia). Sometimes the signs of the coefficients are inconsistent with the domain knowledge; sometimes, explanatory variables which seems individually significant are invalidated when we add other variables.

There are two steps when we want to treat this kind of problem: (1) detecting the presence of the collinearity; (2) implementing solutions in order to obtain more consistent results.

In this tutorial, we study three approaches to avoid the multicollinearity problem: the variable selection; the regression on the latent variables provided by PCA (principal component analysis); the PLS regression (partial least squares).

Keywords: linear regression, multiple regression, collinearity, multicollinearity, principal component analysis, PCA, PLS regression
Component : Multiple linear regression, Linear Correlation, Forward Entry Regression, Principal Component Analysis, PLS Regression, PLS Selection, PLS Conf. Interval
Tutorial: en_Tanagra_Regression_Colinearity.pdf
Dataset: car_consumption_colinearity_regression.xls
References :
Wikipedia, "Multicollinearity"

Monday, April 26, 2010

Linear discriminant analysis on PCA factors

In this tutorial, we show that in certain circumstances, it is more convenient to use the factors computed from a principal component analysis (from the original attributes) as input features for the linear discriminant analysis algorithm.

The new representation space maintains the proximity between the examples. The new features known as "factors" or "latent variables", which are a linear combination of the original descriptors, have several advantageous properties: (a) their interpretation very often allows to detect patterns in the initial space; (b) a very reduced number of factors allows to restore information contained in the data, we can moreover remove the noise from the dataset by using only the most relevant factors (it is a sort of regularization by smoothing the information provided by the dataset); (c) the new features form an orthogonal basis, learning algorithms such as linear discriminant analysis have a better behavior.

This approach has a connection to the reduced-rank linear discriminant analysis. But, instead to this last one, the class information is not needed during the computations of the principal components. The computation can be very fast using an appropriate algorithm when we deal with very high-dimensional dataset (such as NIPALS). But, on the other hand, it seems that the standard reduced-rank LDA tends to be better in terms of classification accuracy.

Keywords: linear discriminant analysis, principal component analysis, reduced-rank linear discriminant analysis
Components: Supervised Learning, Linear discriminant analysis, Principal Component Analysis, Scatterplot, Train-test
Tutorial: en_dr_utiliser_axes_factoriels_descripteurs.pdf
Dataset: dr_waveform.bdm
References:
Wikipedia, "Linear discriminant analysis".

Thursday, April 22, 2010

Induction of fuzzy rules using Knime

This tutorial is the continuation of the one devoted to the induction of decision rules (Supervised rule induction - Software comparison). I have not included Knime in the comparison because it implements a method which is different compared with the other tools. Knime computes fuzzy rules. It wants that the target variable is continuous. That seems rather mysterious in the supervised learning context where the class attribute is usually discrete. I thought it was more appropriate to detail the implementation of the method in a tutorial that is exclusively devoted to the Knime rule learner (version 2.1.1).

Especially, it is important to detail the reason of the data preparation and the reading of the results. To have a reference, we compare the results with those provided by the rule induction tool proposed by Tanagra.

Scientific papers about the method are available on line.

Keywords: induction of rules, supervised learning, fuzzy rules
Components: SAMPLING, RULE INDUCTION, TEST
Tutorial: en_Tanagra_Induction_Regles_Floues_Knime.pdf
Dataset: iris2D.txt
References :
M.R. Berthold, « Mixed fuzzy rule formation », International Journal of Approximate Reasonning, 32, pp. 67-84, 2003.
T.R. Gabriel, M.R. Berthold, « Influence of fuzzy norms and other heuristics on mixed fuzzy rule formation », International Journal of Approximate Reasoning, 35, pp.195-202, 2004.

Friday, April 16, 2010

"Wrapper" for feature selection (continuation)

This tutorial is the continuation of the preceding one about the wrapper feature selection in the supervised learning context (http://data-mining-tutorials.blogspot.com/2010/03/wrapper-for-feature-selection.html). We analyzed the behavior of Sipina, and we have described the source code for the wrapper process (forward search) under R (http://www.r-project.org/). Now, we show the utilization of the same principle under Knime 2.1.1, Weka 3.6.0 and RapidMiner 4.6.

The approach is as follows: (1) we use the training set for the selection of the most relevant variables for classification; (2) we learn the model on selected descriptors; (3) we assess the performance on a test set containing all the descriptors.

This third point is very important. We cannot know the variables that will be finally selected. We do not have to manually prepare the test file by including only those which have been selected by the wrapper procedure. This is essential for the automation of the process. Indeed, otherwise, each change of setting in the wrapper procedure leading to another subset of descriptors would require us to manually edit the test file. This is very tedious.

In the light of this specification, it appeared that only Knime was able to implement the complete process. With the other tools, it is possible to select the relevant variables on the training file. But, I could not (or I did not know) apply the model on a test file containing all the original variables.

The naive bayes classifier is the learning method used in this tutorial .

Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, knime, weka, rapidminer
Tutorial: en_Tanagra_Wrapper_Continued.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.
Wikipedia, "Naive bayes classifier".

Tuesday, March 30, 2010

"Wrapper" for feature selection

The feature selection is a crucial aspect of supervised learning process. We must determine the relevant variables for the prediction of the target variable. Indeed, a simpler model is easier to understand and interpret; the deployment will be facilitated, we need less information to collect for prediction; finally, a simpler model is often more robust in generalization i.e. when we want to classify an unseen instance from the population.

Three kinds of approaches are often highlighted into the literature. Among them, the WRAPPER approach uses explicitly a performance criterion during the search of the best subset of descriptors. Most often, this is the error rate. But in reality, any kind of criteria can be used. This may be the cost if we use a misclassification cost matrix. It can be the area under curve (AUC) when we assess the classifier using ROC curves, etc. In this case, the learning method is considered as a black box. We try various subsets of predictors. We will choose the one that optimizes the criterion.

In this tutorial, we implement the WRAPPER approach with SIPINA and R 2.9.2. For this last one, we give the source code for a forward search strategy. The readers can easily adapt the program to other dataset. Moreover, a careful reading of the source code for R gives a better understanding about the calculations made internally by SIPINA.

The WRAPPER strategy is a priori the best since it explicitly optimizes the performance criterion. We verify this by comparing the results with those provided by the FILTER approach (FCBF method) available into TANAGRA. The conclusions are not as obvious as one can think.

Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, fcbf, sipina, R software, RWeka paclage
Components: DISCRETE SELECT EXAMPLES, FCBF FILTERING, NAIVE BAYES, TEST
Tutorial: en_Tanagra_Sipina_Wrapper.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.

Tuesday, March 23, 2010

Tanagra - Version 1.4.36

ReliefF is a component for automatic variable selection in a supervised learning task. It can handle both continuous and discrete descriptors. It can be inserted before any supervised method.

Naive Bayes was modified. It now described a prediction model in an explicit form (in a linear combination form), easy to understand and to deploy.

Thursday, February 11, 2010

Supervised rule induction - Software comparison

Supervised rule induction methods play an important role in the Data Mining framework. Indeed, it provides an easy to understand classifier. A rule uses the following representation: "IF premise THEN conclusion" (e.g. IF an account problem is reported on a client THEN the credit is not accepted).

Among the rule induction methods, the "separate and conquer" approaches are very popular during the 90's. Curiously, they are less present today into proceedings or journals. More troublesome still, they are not implemented in commercial software. They are only available in free tools from the Machine Learning community. However, they have several advantages compared to other techniques.

In this tutorial, we describe first two separate and conquer algorithms for the rule induction process. Then, we show the behavior of the classification rules algorithms implemented in various tools such as Tanagra 1.4.34, Sipina Research 3.3, Weka 3.6.0, R 2.9.2 with the RWeka package, RapidMiner 4.6, or Orange 2.0b.

Keywords: rule induction, separate and conquer, top-down, CN2, decision tree
Composants : SAMPLING, DECISION LIST, RULE INDUCTION, TEST
Tutorial: en_Tanagra_Rule_Induction.pdf
Dataset: life_insurance.zip
References:
J. Furnkranz, "Separate-and-conquer Rule Learning", Artificial Intelligence Review, Volume 13, Issue 1, pages 3-54, 1999.
P. Clark, T. Niblett, "The CN2 Rule Induction Algorithm", Machine Learning, 3(4):261-283, 1989.
P. Clark, R. Boswell, "Rule Induction with CN2: Some recent improvements", Machine Learning - EWSL-91, pages 151-163, Springer Verlag, 1991.

Tuesday, January 19, 2010

Tanagra - Version 1.4.35

CTP. The method of detection of the right size of the tree is modified for the "Clustering Tree" with post-pruning component (CTP). It relies both on the angle between half-lines at each point on the curve of decreasing the WSS (within-group sum of squares) on the growing sample and the decrease of the same indicator computed on the pruning sample. Compared to the previous implementation, it results in a smaller number of clusters.

Regression Tree. The previous modification is incorporated into the Regression Tree component which is a univariate version of CTP.

C-RT Regression Tree. A new regression tree component was added. It faithfully implements the technique described in the Breiman's and al. (1984) book, including the post-pruning part with the 1-SE Rule (Chapter 8, especially p. 226 about the formula for the variance of the MSE).

C-RT. The report of the induction of decision tree C-RT has been completed. Based on the last column of the post-pruning table, it becomes easier to choose the parameter x (in x-SE Rule) to arbitrarily define the size of the pruned tree.

Some tutorials will describe these various changes soon.

Monday, January 4, 2010

Dealing with very large dataset in Sipina

The ability to handle large databases is a crucial problem in the data mining context. We want to handle a large dataset in order to detect the hidden information. Most of the free data mining tools have problems with large dataset because they load all the instances and variables into memory. Thus, the limitation of these tools is the available memory.

To overcome this limitation, we should design solutions that allow to copy all or part of the data on disk, and perform treatments by loading into memory only what is necessary at each step of the algorithm (the instances and/or the variables). If the solution is theoretically simple, it is difficult in practice. Indeed, the processing time should remain reasonable even if we increase the disk access. It is very difficult to implement a strategy that is effective regardless of the learning algorithm used (supervised learning, clustering, factorial analysis, etc.). They handle the data in very different way: some of them use intensively matrix operations; the others search mainly the co-occurrence between attribute-value pairs, etc.

In this tutorial, we present a specific solution in the induction tree context. The solution is integrated into SIPINA (as optional) because its internal data structure is especially intended to the decision tree induction. Developing an approach which takes advantages of the specificities of the learning algorithm was easy in this context. We show that it is then possible to handle a very large dataset (41 variables and 9,634,198 observations) and to use all the functionalities of the tool (interactive construction of the tree, local descriptive statistics on nodes, etc.).

To fully appreciate the solution proposed by Sipina, we compare its behavior to generalist data mining tools such as Tanagra 1.4.33 or Knime 2.03.

Keywords: very large dataset, decision tree, sampling, sipina, knime
Components: ID3
Lien : en_Sipina_Large_Dataset.pdf
Données : twice-kdd-cup-discretized-descriptors.zip
Références :
Tanagra, « Decision tree and large dataset ».
Tanagra, « Local sampling for decision tree learning »

Saturday, January 2, 2010

CART - Determining the right size of the tree

Determining the appropriate size of the tree is a crucial task in the decision tree learning process. It determines its performance during the deployment into the population (the generalization process). There are two situations to avoid: the under-sized tree, too small, poorly capturing relevant information in the training set; the over-sized tree capturing specific information of the training set, which specificities are not relevant to the population. In both cases, the prediction model performed poorly during the generalization phase.

Among the many variants of decision trees learning algorithms, CART is probably the one that detects better the right size of the tree.

In this tutorial, we describe the selection mechanism used by CART during the post-pruning process. We show also how to set the appropriate value of the parameter of the algorithm in order to obtain a specific (a user-defined) tree.

Keywords: decision tree, CART, 1-SE Rule, post-pruning
Components: Discrete select examples, Supervised Learning, C-RT, Test
Tutorial: en_Tanagra_Tree_Post_Pruning.pdf
Dataset: adult_cart_decision_trees.zip
References :
L. Breiman, J. Friedman, R. Olshen, C. Stone, " Classification and Regression Trees ", California : Wadsworth International, 1984.
R. Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (tutoriel_arbre_revue_modulad_33.pdf)