Tanagra - Data Mining and Data Science Tutorials

Monday, November 5, 2012

Linear Discriminant Analysis - Tools comparison

Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition. Indeed, it has interesting properties: it is rather fast on large bases; it can handle naturally multi-class problems (target attribute with more than 2 values); it generates a linear classifier linear, easy to interpret; it is robust and fairly stable, even applied on small databases; it has an embedded variable selection mechanism. Personally, I appreciate linear discriminant analysis because we can have multiple interpretations (probabilistic, geometric), and thus highlights various aspects of supervised learning.

In this tutorial, we highlight the similarities and the differences between the outputs of Tanagra, R (MASS and klaR packages), SAS, and SPSS software. The main conclusion is that, if the presentation is not always the same, ultimately we have exactly the same results. This is the most important.

Keywords: linear discriminant analysis, predictive discriminant analysis, canonical discriminant analysis, variable selection, feature selection, sas, stepdisc, candisc, R software, xlsx package, MASS package, lda, klaR package, greedy.wilks, confusion matrix, resubstitution error rate
Components: LINEAR DISCRIMINANT ANALYSIS, CANONICAL DISCRIMINANT ANALYSIS, STEPDISC
Tutorial: en_Tanagra_LDA_Comparisons.pdf
Dataset: alcohol
References :
Wikipedia - "Linear Discriminant Analysis"

Monday, October 29, 2012

Handling missing values in prediction process

The treatment of missing values during the learning process has been received a lot of attention of researchers. We have published a tutorial about this in the context of logistic regression induction . By contrast, the handling of the missing values during the classification process, i.e. when we apply the classifier on an unlabeled instance, is less studied. However, the problem is important. Indeed, the model is designed to work only when the instance to label is fully described. If some values are not available, we cannot directly apply the model. We need a strategy to overcome this difficulty .

In this tutorial, we are in the supervised learning context. The classifier is a logistic regression model. All the descriptors are continuous. We want to evaluate on various datasets from the UCI repository the behavior of two imputations methods: the univariate approach and the multivariate approach. The constraint is that the imputation models must rely on information from the learning sample. We consider that this last one does not contain missing values.

We note that the occurrence of the missing value on the instance to classify is "missing completely at random" in our experiments i.e. each descriptor has the same probability to be missing.

Keywords: missing values, missing features, classification model, logistic regression, multiple linear regression, r software, glm, lm, NA
Components: Binary Logistic Regression
Tutorial: : en_Tanagra_Missing_Values_Deployment.pdf
Dataset and programs (R language): md_logistic_reg_deployment.zip
References:
Howell, D.C., "Treatment of Missing Data".
M. Saar-Tsechansky, F. Provost, “Handling Missing Values when Applying Classification Models”, JMLR, 8, pp. 1625-1657, 2007.

Sunday, October 14, 2012

Handling Missing Values in Logistic Regression

The handling of missing data is a difficult problem. Not because of its management which is simple, we just report the missing value with a specific code, but rather because of the consequences of their treatment on the characteristics of the models learned on the treated data.

We have already analyzed this problem in a previous paper. We studied the impact of the different techniques of missing values treatment on a decision tree learning algorithm (C4.5). In this paper, we repeat the analysis by examining their influence on the results of the logistic regression. We consider the following configuration: (1) missing values are MCAR, we wrote a program which removes randomly some values in the learning sample; (2) we apply logistic regression on the pre-treated training data i.e. on a dataset on which we apply a missing value processing technique; (3) we evaluate the different techniques of treatment of missing data by observing the accuracy rate of the classifier on a separate test sample which has no missing values.

In a first time, we conduct the experiments with R. We compare the listwise deletion approach to the univariate imputation (the mean for the quantitative variables, the mode for the categorical ones). We will see that this latter is a very viable approach in MCAR situation. In a second time, we will study the available tools in Orange, Knime and RapidMiner. We will observe that despite their sophistication, they are not better than the univariate imputation in our context.

Keywords: missing value, missing data, logistic regression, listwise deletion, casewise deletion, univariate imputation, missing data, R software, glm
Tutorial: en_Tanagra_Missing_Values_Imputation.pdf
Dataset and programs: md_experiments.zip
References:
Howell, D.C., "Treatment of Missing Data".
Allison, P.D. (2001), « Missing Data ». Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Thousand Oaks, CA : Sage.
Little, R.J.A., Rubin, D.B. (2002), « Statistical Analysis with Missing Data », 2nd Edition, New York : John Wiley.

Monday, September 24, 2012

Tanagra - Version 1.4.47

Non iterative Principal Factor Analysis (PFA). This is an approach which tries to detect underlying structures in the relationships between the variables of interest. Unlike PCA, the PFA is focused only on the shared variances of the set of variables. It is suited when the goal is to uncover the latent structure of the variables. It works on a slightly modified version of the correlation matrix where the diagonal, the prior communality estimate of each variable, is replaced by its squared multiple correlation with all others.

Harris Component Analysis. This is a non-iterative factor analysis approach. It tries to detect underlying structures in the relationships between the variable of interest. Like Principal Factor Analysis, it focuses on the shared variances of the set of variables. It works on a modified version of the correlation matrix.

Principal Component Analysis. Two functionalities are added: the reproduced and residual correlation matrices can be computed, the variables can be sorted according to the loadings in the output tables.

These three components can be combined with the FACTOR ROTATION component (varimax or quartimax).

They can be combined to the re-sampling approaches for the detection of the relevant number of factors (PARALLEL ANALYSIS and BOOTSTRAP EIGENVALUES).

Download page : setup

Saturday, September 1, 2012

Tanagra - Version 1.4.46

AFDM (Factor analysis for mixed data). It extends the principal component analysis (PCA) to data containing a mixture of quantitative and qualitative variables. The method is developed by Pagès (2004). A tutorial will come to describe the use of the method and the reading of the results.

Download page : setup

Friday, July 6, 2012

CVM and BVM from the LIBCVM toolkit

The Support Vector Machines algorithms are well-known in the supervised learning domain. They are especially appropriate when we handle a dataset with a large number “p” of descriptors . But they are much less efficient when the number of instances “n” is very high. Indeed, a naive implementation is of complexity O(n^3) for the calculation time and O(n^2) for the storing of the values. In consequence, instead of the optimal solution, the learning algorithms often highlight the near-optimal solutions with a tractable computation time .

I recently discovered the CVM (Core Vector Machine) and BVM (Ball Vector Machine) approaches. The idea of the authors is really clever: since only approximate best solutions can be highlighted, their approaches try to resolve an equivalent problem which is easier to handle - the minimum enclosing ball problem in computational geometry - to detect the support vectors. So, we have a classifier which is as efficient as those obtained by the other SVM learning algorithms, but with an enhanced ability to process datasets with a large number of instances.

I found the papers really interesting. They are all the more interesting that all the tools enabling to reproduce the experiments are provided: the program and the datasets. So, all the results shown in the paper can be verified. It contrasts to too numerous papers where some authors flaunt tremendous results but we can never reproduce them.

The CVM and BVM methods are incorporated into the LIBCVM library. This is an extension of the LIBSVM (version 2.85), which is already included into Tanagra. The source code for LIBCVM being available, I compiled it as a DLL (Dynamic-link Library) and I included it also into Tanagra 1.4.44.

In this tutorial, we describe the behavior of the CVM and BVM supervised learning methods on the "Web" dataset available on the website of the authors. We compare the results and the computation time to those of the C-SVC algorithm based on the LIBSVM library.

Keywords: support vector machine, svm, libcvm, cvm, bvm, libsvm, c-svc
Components: SELECT FIRST EXAMPLES, CVM, BVM, C-SVC
Tutorial: en_Tanagra_LIBCVM_library.pdf
Dataset: w8a.txt.zip
References :
I.W. Tsang, A. Kocsor, J.T. Kwok : LIBCVM Toolkit, Version: 2.2 (beta)
C.C Chang, C.J. Lin : LIBSVM -- A Library for Support Vector Machines

Wednesday, July 4, 2012

Revolution R Community 5.0

The R software is a fascinating project. It becomes a reference tool for the data mining process. With the R package system, we can extend its features potentially at the infinite. Almost all existing statistical / data mining techniques are available in R.

But if there are many packages, there are very few projects which intend to improve the R core itself. The source code is freely available. In theory anyone can modify a part or even the whole software. Revolution Analytics proposes an improved version of R. It provides Revolution R Enterprise, it seems (according to their website) that: it improves dramatically the fastness of some calculations; it can handle very large database; it provides a visual development environment with a debugger. Unfortunately, this is a commercial tool. I could not check these features . Fortunately, a community version is available. Of course, I have downloaded the tool to study its behavior.

Revolution R Community is a slightly improved version of the Base R. The enhancements are essentially related to the calculations performances: it incorporates the Intel Math Kernal libary, which is especially efficient for the matrix calculations; it can take advantage also, in some circumstances, from the power of the multi-core processors. Performance benchmarks are available on the editor's website. The results are impressive. But we note that they are based on datasets generated artificially.

In this tutorial, we extend the benchmark to other data mining methods. We analyze the behavior of the Revolution R Community 5.0 - 64 bit version in various contexts: binary logistic regression (glm); linear discriminant analysis (lda from the MASS package); induction of decision trees (rpart from the rpart package); principal component analysis based on two different principles, the first one is based on the calculations of the eigenvalues and eigenvectors from the correlation matrix (princomp), the second one is done by a singular value decomposition of the data matrix (prcomp).

Keywords: R software, revolution analytics, revolution r community, logistic regression, glm, linear discriminant analysis, lda, principal components analysis, acp, princomp, prcomp, matrix calculations, eigenvalues, eignevectors, singular value decomposition, svd, decision tree, cart, rpart
Tutorial: en_Tanagra_Revolution_R_Community.pdf
Dataset: revolution_r_community.zip
References :
Revolution Analytics, "Revolution R Community".

Monday, July 2, 2012

Introduction to SAS proc logistic

In my courses at the University, I use only free data mining tools (R, Tanagra, Sipina, Knime, Orange, etc.) and the spreadsheet applications (free or not). Sometimes, my students ask me if the commercial tools (e.g. SAS which is very popular in France) have different behavior, in terms of how to use, or for the reading of the results. I say them that some of these commercial tools are available on the computers of our department. They can learn how to use them by taking as a starting point the tutorials available on the Web.

But unfortunately, especially in the French language, they are not numerous about the logistic regression. We need a didactic document with clear screenshots which show how to: (1) import a data file into a SAS bank; (2) define an analysis with the appropriate settings; (3) read and understand the results.

In this tutorial, we describe the use of the SAS PROC LOGISTIC (SAS 9.3). We measure its quickness when we handle a moderate sized dataset. We compare the results with those of Tanagra 1.4.43.

Keywords: sas, proc logistic, binary logistic regression
Components: BINARY LOGISTIC REGRESSION
Tutorial: en_Tanagra_SAS_Proc_Logistic.pdf
Dataset: wave_proc_logistic.zip
References :
SAS - "The LOGISTIC Procedure"
Tanagra - "Logistic regression - Software comparison"
Tanagra - "Logistic regression on large dataset"

Saturday, June 30, 2012

SAS Add-In 4.3 for Excel

The connection between a data mining tool and a spreadsheet application such as Excel is a really valuable feature. We benefit from the powerful of the first one, and the popularity and the easy to use of the second one. Many people use a spreadsheet in their data preparation phase. Recently, I have presented an add-in for the connection between R and Excel. In this document, I describe a similar tool for the SAS software.

SAS is a popular tool, well-known of the statisticians. But the use of SAS is not really simple for the non-specialist people. We must know the syntax of the commands before to perform a statistical analysis. With the SAS add-in for Excel, some of the SAS drawbacks are alleviated: we do not need to load and organize the dataset into a bank; we do not need to know the command syntax to perform an analysis and set the associated parameters (we use a menu and dialog boxes instead); the results are automatically incorporated in a new sheet of an Excel workbook (the post processing of the results becomes easy).

In this tutorial, I describe the behavior of the add-in for various kinds of analyses (nonparametric statistic, logistic regression). We compare the results with those of Tanagra.

Keywords: excel, sas, add-on, add-in, logistic regression, nonparametric test
Components : MANN-WHITNEY COMPARISON, KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, ANSARI-BRADLEY SCALE TEST, KLOTZ SCALE TEST, MOOD SCALE TEST
Tutorial: en_Tanagra_SAS_AddIn_4_3_for_Excel.pdf
Dataset: scoring_dataset.xls
References :
SAS - http://www.sas.com/
SAS - "SAS Add-in for Microsoft Office"
Tanagra Tutorial - "Tanagra Add-In for Office 2007 and Office 2010"

Tuesday, June 12, 2012

Tanagra - Version 1.4.45

New features for the principal component analysis (PCA).

PRINCIPAL COMPONENT ANALYSIS. Additional outputs for the component: Scree plot and variance explained cumulative curve; PCA Correlation Matrix - Some outputs are provided for the detection of the significant factors (Kaiser-Guttman, Karlis-Saporta-Spinaki, Legendre-Legendre broken-stick test); PCA Correlation Matrix - Bartlett's sphericity test is performed and the Kaiser's measure of sampling adequacy (MSA) is calculated; PCA Correlation Matrix - The correlation matrix and the partial correlations between each pair of variables controlling for all other variables (the negative anti-image correlation) are produced.

PARALLEL ANALYSIS. The component calculates the distribution of eigenvalues for a set of randomly generated data. It proceeds by randomization. It applies to the principal components analysis and te multiple correspondence analysis. A factor is considered significant if its observed eigenvalue is greater than the 95-th percentile (this setting can be modified).

BOOTSTRAP EIGENVALUES. It calculates by bootstrap approach the confidence intervals of eigenvalues. A factor is considered significant if its eigenvalue is greater than a threshold which depends on the underlying factor method (PCA or MCA) method, or if the lower bound of the eigenvalue of a factor is greater than higher bound of the following one. The confidence level 0.90 can be modified. This component can be applied to the principal component analysis or the multiple correspondence analysis.

JITTERING. Jittering feature is incorporated to the scatter plot components (SCATTERPLOT, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL, VIEW MULTIPLE SCATTERPLOT).

RANDOM FOREST. The not used memory is released after the decision tree learning process. This feature is especially useful when we use an ensemble learning approach where we store a large number of trees in memory (BAGGING, BOOSTING, RANDOM FOREST). The memory occupation is reduced. The computation capacity is improved.

Download page : setup

Monday, May 14, 2012

Tanagra - Version 1.4.44

LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). Update of the LIBSVM library for support vector machine algorithms (version 3.12, April 2012) [C - SVC, Epsilon-SVR, nu - SVR]. The calculations are faster. The attributes can be normalized or not. They were automatically normalized previously.

LIBCVM (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html; version 2.2). Incorporation of the LIBCVM library. Two methods are available: CVM and BVM (Core Vector Machine and Ball Vector Machine). The dezscriptors can be normalized or not.

TR-IRLS (http://autonlab.org/autonweb/10538). Update of the TR-IRLS library, for the logistic regression on large dataset (large number of predictive attributes) [last available version – 2006/05/08]. The deviance is automatically provided. The display of the regression coefficients is more precise (higher number of decimals). The user can tune the learning algorithms, especially the stopping rules.

SPARSE DATA FILE. Tanagra can handle sparse data file format now (see SVMlight ou libsvm file format). The data can be used for supervised learning process or regression problem. A description of this kind of file is available on line (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html).

INSTANCE SELECTION. A new component for the selection of the m first individuals among n in a branch of the diagram is available [SELECT FIRST EXAMPLES]. This option is useful when the data file is the result of the concatenation of the learning and test samples.

Download page : setup

Thursday, May 3, 2012

Using PDI-CE for model deployment (PMML)

Model deployment is a crucial task of the data mining process. In the supervised learning, it can be the applying of the predictive model on new unlabeled cases. We have already described this task for various tools (e.g. Tanagra, Sipina, Spad, R). They have as common feature the use of the same tool for the model construction and the model deployment.

In this tutorial, we describe a process where we do not use the same tool for the model construction and the model deployment. This is only possible if (1) the model is described in a standard format, (2) the tool which used for the deployment can handle both the database with unlabeled instances and the model. Here, we use the PMML standard description for the sharing of the model, and the PDI-CE (Pentaho Data Integration Community Edition) for the applying of the model on the unseen cases.

We create a decision tree with various tools such as SIPINA, KNIME or RAPIDMINER; we export the model in the PMML format; then, we use PDI-CE for applying the model on a data file containing unlabeled instances. We see that the use of the PMML standard enhances dramatically the powerful of both the data mining tool and the ETL tool.

In addition, we will describe other solutions for deployment in this tutorial. We will see that Knime has its own PMML reader. It is able to apply a model on unlabeled datasets, whatever the tool used for the construction of the model. The key is that the PMML standard is respected. In this sense, Knime can be substituted to PDI-CE. Another possible solution, Weka, which is included into the Pentaho Community Edition suite, can export the model in a proprietary format that PDI-CE can handle.

Keywords: model deployment, predictive model, pmml, decision tree, rapidminer 5.0.10, weka 3.7.2, knime 2.1.1, sipina 3.4
Tutorial: en_Tanagra_PDI_Model_Deployment.pdf
Dataset: heart-pmml.zip
References:
Data Mining Group, "PMML standard"
Pentaho, "Pentaho Kettle Project"
Pentaho, "Using the Weka Scoring Plugin"

Sunday, April 22, 2012

Pentaho Data Integration - Kettle

The Pentaho BI Suite is an open source Business Intelligence suite with integrated reporting, dashboard, data mining, workflow and ETL capabilities (http://en.wikipedia.org/wiki/Pentaho).

In this tutorial, we talk about the Pentaho BI Suite Community Edition (CE) which is freely downloadable. More precisely, we present the Pentaho Data Integration (PDI-CE) , called also Kettle. We show briefly how to load a dataset and perform a simplistic data analysis. The main goal of this tutorial is to introduce a next one focused on the deployment of the models designed with Knime, Sipina or Weka by using PDI-CE.

This document is based on the 4.0.1 stable version of PDI-CE.

Keywords: ETL, pentaho data integration, community edition, kettle, BI, business intelligence, data importation, data transformation, data cleansing
Tutorial: PDI-CE
Dataset: titanic32x.csv.zip
References :
Pentaho, Pentaho Community

Monday, April 9, 2012

Mining frequent itemsets

Searching regularities from dataset is the main goal of the data mining. They may have various representations. In the market basket analysis, we search the co occurrences of goods (items) i.e. the goods which are often purchased simultaneously. They are called “frequent itemset”. For instance, one result may be "milk and bread are purchased simultaneously in 10% of caddies".

Frequent itemset mining is often presented as the preceding step of the association rule learning algorithm. At the end of the process, we highlight the direction of the relation. We obtain rules. For instance, a rule may be "90% of the customers which buy milk and bread will purchase butter also". This kind of rule can be used in various manners. For instance, we can promote the sales of milk and bread in order to increase the sales of butter.

In fact, frequent itemsets provide also valuable information. Detecting the goods which are purchased simultaneously enables to understand the relation between them. It is a kind of variant of the clustering analysis. We search the items which come together. For instance, we can use this kind of information in order to reorganize the shelves of the store.

In this tutorial, we describe the use of the FREQUENT ITEMSETS component under Tanagra. It is based on the Borgelt's “apriori.exe” program. We use a very small dataset. It enables to everyone to reproduce manually the calculations. But, in a first time, we describe some definitions about the frequent itemset mining process.

Keywords: frequent itemsets, closed itemsets, maximal itemsets, generator itemsets, association rules, R software, arules package
Components: FREQUENT ITEMSETS
Tutorial: en_Tanagra_Itemset_Mining.pdf
Dataset: itemset_mining.zip
References :
C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"
R. Lovin, "Mining Frequent Patterns"

Sunday, April 1, 2012

Sipina add-on for OOCalc

Combining a spreadsheet with the data mining tools is essential for the popularity of these last ones. Indeed, when we deal with a moderate sized dataset (thousands of rows and tens of variables), the spreadsheet is a practical tool for the data preparation. This is also a valuable tool for the preparation of the reports. It is thus not surprising that Excel, and generally speaking a spreadsheet, is one the most used tool by data miners.

Both Tanagra and Sipina provide an add-on for Excel. The add-on enables to insert a data mining tool menu into the spreadsheet. The user can select and send the dataset to Tanagra (or Sipina), which is automatically launched. But, only Tanagra provides an add-on for Open Office Calc and Libre Office Calc. It is not available for Sipina.

This omission has been corrected for this new version of Sipina (Sipina 3.9). In this tutorial, we show how to install and use the “SipinaLibrary.oxt” add-on for Open Office Calc 3.3.0 (OOCalc). The process is the same for Libre Office 3.5.1.

Keywords: calc, open office, libre office, oocalc, add-on, add-in, sipina
Tutorial: en_sipina_calc_addon.pdf
Dataset: heart.xls
References :
Tutoriel Tanagra - Sipina add-in for Excel
Tutoriel Tanagra - Tanagra add-on for Open Office Calc 3.3
Open Office - http://www.openoffice.org
Libre Office - http://www.libreoffice.org/