Tanagra - Data Mining and Data Science Tutorials: May 2010

Thursday, May 27, 2010

Logistic Regression Diagnostics

This tutorial describes the implementation of tools for the diagnostic and the assessment of a logistic regression. These tools are available in Tanagra version 1.4.33 (and later).

We deal with a credit scoring problem. We try to determine by using logistic regression the factors underlying the agreement or refusal of a credit to customers. We perform the following steps:
- Estimating the parameters of the classifier;
- Retrieving the covariance matrix of coefficients;
- Assessment using the Hosmer and Lemeshow goodness of fit test;
- Assessment using the reliability diagram;
- Assessment using the ROC curve;
- Analysis of residuals, detection of outliers and influential points.

On the one hand, we use Tanagra 1.4.33. Then, on the other hand, we perform the same analysis using the R 2.9.2 software [glm(.) procedure].

Keywords: logistic regression, residual analysis, outliers, influential points, pearson residual, deviance residual, leverage, cook's distance, dfbeta, dfbetas, hosmer-lemeshow goodness of fit test, reliability diagram, calibration plot, glm()
Components: BINARY LOGISTIC REGRESSION, HOSMER LEMESHOW TEST, RELIABILITY DIAGRAM, LOGISTIC REGRESSION RESIDUALS
Tutorial: en_Tanagra_Logistic_Regression_Diagnostics.pdf
Dataset: logistic_regression_diagnostics.zip
References :
D. Garson, "Logistic Regression"
D. Hosmer, S. Lemeshow, « Applied Logistic Regression », John Wiley &Sons, Inc, Second Edition, 2000.

Friday, May 21, 2010

Discretization of continuous features

The discretization transforms a continuous attribute into a discrete one. To do that, it partitions the range into a set of intervals by defining a set of cut points. Thus we must answer to two questions to lead this data transformation: (1) how to determine the right number of intervals; (2) how to compute the cut points. The resolution is not necessarily in that sequence.

The best discretization is the one performed by an expert domain. Indeed, he takes into account other information than those only provided by the available dataset. Unfortunately, this kind of approach is not always feasible because: often, the domain knowledge is not available or it does not allow to determine the appropriate discretization; the process cannot be automated to handle a large number of attributes. So, we are often forced to found the determination of the best discretization on a numerical process.

Discretization of continuous features as preprocessing for supervised learning process. First, we must define the context in which we perform the transformation. Depending on the circumstances, it is clear that the process and criteria used will not be the same. In this tutorial, we are in the supervised learning framework. We perform the discretization prior to the learning process i.e. we transform the continuous predictive attributes into discrete before to present them to a supervised learning algorithm. In this context, the construction of intervals in which one and only one of the values of the target attribute is the most represented is desirable. The relevance of the computed solution is often evaluated through an impurity based or an entropy based functions.

In this tutorial, we use only the univariate approaches. We compare the behavior of the supervised and the unsupervised algorithms on an artificial dataset. We use several tools for that: Tanagra 1.4.35, Sipina 3.3, R 2.9.2 (package dprep), Weka 3.6.0, Knime 2.1.1, Orange 2.0b and RapidMiner 4.6.0. We highlight the settings of the algorithms and the reading of the results.

Keywords: mdlpc, discretization, supervised learning, equal frequency intervals, equal width intervals
Components: MDLPC, Supervised Learning, Decision List
Tutorial: en_Tanagra_Discretization_for_Supervised_Learning.pdf
Dataset: data-discretization.arff
References :
F. Muhlenbach, R. Rakotomalala, « Discretization of Continuous Attributes », in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), pp. 397-402, 2005 (http://hal.archives-ouvertes.fr/hal-00383757/fr/).
Tanagra Tutorial, "Discretization and Naive Bayes Classifier"

Sunday, May 16, 2010

Sipina Decision Graph Algorithm (case study)

SIPINA is a data mining tool. But it is also a machine learning method. It corresponds to an algorithm for the induction of decision graphs (see References, section 9). A decision graph is a generalization of a decision tree where we can merge any two terminal nodes of the graph, and not only the leaves issued from the same node.

The SIPINA method is only available under the version 2.5 of SIPINA data mining tool. This version has some drawbacks. Among others, it cannot handle large datasets (higher than 16.383 instances). But it is the only tool which implements the decision graphs algorithm. This is the main reason for which this version is available online to date. If we want to implement a decision tree algorithm such as C4.5 or CHAID, or if we want to create interactively a decision tree , it is more advantageous to use the research version (named also version 3.0). The research version is more powerful and it supplies much functionality for the data exploration.

In this tutorial, we show how to implement the Sipina decision graph algorithm with the Sipina software version 2.5. We want to predict the low birth weight of newborns from the characteristics of their mothers. We want foremost to show how to use this 2.5 version which is not well documented. We want also to point out the interest of the decision graphs when we treat a small dataset i.e. when the data fragmentation becomes a crucial problem.

Keywords: decision graphs, decision trees, sipina version 2.5
Tutorial: en_sipina_method.pdf
Dataset: low_birth_weight_v4.xls
References:
Wikipedia, "Decision tree learning"
J. Oliver, Decision Graphs: An extension of Decision Trees, in Proc. of Int. Conf. on Artificial Intelligence and Statistics, 1993.
R. Rakotomalala, Graphes d'induction, PhD Dissertation, University Lyon 1, 1997 (URL: http://eric.univ-lyon2.fr/~ricco/publications.html; in french).
D. Zighed, R. Rakotomalala, Graphes d'induction : Apprentissage et Data Mining, Hermes, 2000 (in French).

Friday, May 14, 2010

User's guide for the old Sipina 2.5 version

SIPINA has a long history. Before the current version (version 3.3, May 2010), we distributed a data mining tool dedicated exclusively to the induction of decision graphs, a generalization of decision trees. Of course, the state-of-the-art decision trees algorithms are also included (such as C4.5, CHAID).

This version, called 2.5, is online since 1995. Its development was suspended in 1998 when I started programming the version 3.0.

This version 2.5 is the only free tool which implements the decision graphs algorithm. This is a real curiosity in this respect. This is the reason for which I still distribute this version to date.

On the other hand, this 2.5 version has some severe limitations. Among others, it can handle only small dataset, up to 16.380 instances. If you want to implement a decision tree or if you want to handle a large dataset, it is always advised to use the current version (version 3.0 and later).

Setup of the old 2.5 version: Setup_Sipina_V25.exe
User's guide: EnglishDocSipinaV25.pdf
References:
J. Oliver, "Decision Graphs - An Extension of Decision Trees", in Proc. Of the 4-th Int. workshop on Artificial Intelligence and Statistics, pages 343-350, 1993.
R. Rakotomalala, "Induction Graphs", PhD Thesis, University of Lyon 1, 1997 (in French).
D. Zighed, R. Rakotomalala, "Graphes d'Induction - Apprentissage et Data Mining", Hermes, 2000 (in French).

Monday, May 10, 2010

Solutions for multicollinearity in multiple regression

Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with others (Wikipedia). Sometimes the signs of the coefficients are inconsistent with the domain knowledge; sometimes, explanatory variables which seems individually significant are invalidated when we add other variables.

There are two steps when we want to treat this kind of problem: (1) detecting the presence of the collinearity; (2) implementing solutions in order to obtain more consistent results.

In this tutorial, we study three approaches to avoid the multicollinearity problem: the variable selection; the regression on the latent variables provided by PCA (principal component analysis); the PLS regression (partial least squares).

Keywords: linear regression, multiple regression, collinearity, multicollinearity, principal component analysis, PCA, PLS regression
Component : Multiple linear regression, Linear Correlation, Forward Entry Regression, Principal Component Analysis, PLS Regression, PLS Selection, PLS Conf. Interval
Tutorial: en_Tanagra_Regression_Colinearity.pdf
Dataset: car_consumption_colinearity_regression.xls
References :
Wikipedia, "Multicollinearity"