Tanagra - Data Mining and Data Science Tutorials: 2012

Sunday, December 30, 2012

Discriminant Correspondence Analysis

The aim of the canonical discriminant analysis is to explain the belonging to pre-defined groups of instances of a dataset. The groups are specified by a dependent categorical variable (class attribute, response variable); the explanatory variables (descriptors, predictors, independent variables) are all continuous. So, we obtain a small number of latent variables which enable to distinguish as far as possible the groups. These new features, called factors, are linear combinations of the initial descriptors. The process is a valuable dimensionality reduction technique. But its main drawback is that it cannot be directly applied when the descriptors are discrete. Even if the calculations are possible if we recode the variables using dummy variables for instance, the interpretation of the results - which is one of the main goals of the canonical discriminant analysis - is not really obvious.

In this tutorial, we present a variant of the discriminant analysis which is applicable to discrete descriptors due to Hervé Abdi (2007) . The approach is based on a transformation of the raw dataset in a kind of contingency table. The rows of the table correspond to the values of the target attribute; the columns are the indicators associated to the predictors’ values. Thus, the author suggests to use a correspondence analysis, on the one hand, in order to distinguish the groups, and on the other hand, to detect the relevant relationships between the values of the target attribute and those of the explanatory variables. The author called its approach "discriminant correspondence analysis" because it uses a correspondence analysis framework to solve a discriminant analysis problem.

In what follows, we detail the use of the discriminant correspondence analysis with Tanagra 1.4.48. We use the example described in the Hervé Abdi's paper. The goal is to explain the origin of 12 wines (3 possible regions) using 5 descriptors related to characteristics assessed by professional tasters. In a second part (section 3), we reproduce all the calculations with a program written for R.

Keywords: canonical discriminant analysis, descriptive discriminant analysis, correspondence analysis, R software, xlsx package, ca package
Components: DISCRIMINANT CORRESPONDENCE ANALYSIS
Tutorial : Tutorial DCA
Dataset: french_wine_dca.zip
References:
H. Abdi, « Discriminant correspondence analysis », In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage. pp. 270-275, 2007.

Saturday, December 1, 2012

Tanagra - Version 1.4.48

New components have been added.

K-Means Strengthening. This component was suggested to me by Mrs. Claire Gauzente. The idea is to strengthen an existing partition (e.g. from a HAC) by using K-Means algorithm. A comparison of groups before and after optimization is proposed, indicating the efficiency of the optimization. The approach can be plugged to all clustering algorithm into Tanagra. Thanks to Claire for this valuable idea.

Discriminant Correspondence Analysis. This is an extension of the canonical discriminant analysis to discrete attributes (Hervé Abdi, 2007). The approach is based on a clever transformation of the dataset. The initial dataset is transformed into a crosstab. The values of the target attribute are in row, all the values of the input attributes are in column. The algorithm performs a correspondence analysis to this new data table to identify the associations between the values of the target and the input variables. Thus, we dispose of the tools of the correspondence analysis for a comprehensive reading of the results (factor scores, contributions, quality of representation).

Other components have been improved.

HAC. After the choice of the number of groups in the dendrogram in the Hierarchical Agglomerative Clustering, a last pass on the data is performed, it assigns each individual of the learning sample into the group with the nearest centroid. Thus, there may be discrepancy between the number of instances displayed on the tree nodes and the number of individuals in the groups. Tanagra displays the two partitions. Only the last one is used when Tanagra applies the clustering model on new instances, when it computes conditional statistics, etc.

Correspondence Analysis. Tanagra now provides the coefficients of the factor score functions for supplementary columns and rows in the factorial correspondence analysis. Thus, it will be possible to easily calculate the factor scores of new points described by their row or column profile. Finally, the results tables can be sorted according to contributions to the factors of the modalities.

Multiple correspondence analysis. Several improvements have been made to the multiple correspondence analysis: the component knows how to take into account supplementary continuous and discrete variables; the variables can be sorted according to their contribution to the factors; all indicators for the interpretation can be brought together in a single large table for a synthetic visualization of the results, this feature is especially interesting if we have a small number of factors; the coefficients for the factor score functions are provided, we can easily calculate the factorial coordinates of the supplementary individuals apart from Tanagra.

Some tutorials will come soon to describe the use of these components on realistic case studies.

Download page : setup

Monday, November 5, 2012

Linear Discriminant Analysis - Tools comparison

Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition. Indeed, it has interesting properties: it is rather fast on large bases; it can handle naturally multi-class problems (target attribute with more than 2 values); it generates a linear classifier linear, easy to interpret; it is robust and fairly stable, even applied on small databases; it has an embedded variable selection mechanism. Personally, I appreciate linear discriminant analysis because we can have multiple interpretations (probabilistic, geometric), and thus highlights various aspects of supervised learning.

In this tutorial, we highlight the similarities and the differences between the outputs of Tanagra, R (MASS and klaR packages), SAS, and SPSS software. The main conclusion is that, if the presentation is not always the same, ultimately we have exactly the same results. This is the most important.

Keywords: linear discriminant analysis, predictive discriminant analysis, canonical discriminant analysis, variable selection, feature selection, sas, stepdisc, candisc, R software, xlsx package, MASS package, lda, klaR package, greedy.wilks, confusion matrix, resubstitution error rate
Components: LINEAR DISCRIMINANT ANALYSIS, CANONICAL DISCRIMINANT ANALYSIS, STEPDISC
Tutorial: en_Tanagra_LDA_Comparisons.pdf
Dataset: alcohol
References :
Wikipedia - "Linear Discriminant Analysis"

Monday, October 29, 2012

Handling missing values in prediction process

The treatment of missing values during the learning process has been received a lot of attention of researchers. We have published a tutorial about this in the context of logistic regression induction . By contrast, the handling of the missing values during the classification process, i.e. when we apply the classifier on an unlabeled instance, is less studied. However, the problem is important. Indeed, the model is designed to work only when the instance to label is fully described. If some values are not available, we cannot directly apply the model. We need a strategy to overcome this difficulty .

In this tutorial, we are in the supervised learning context. The classifier is a logistic regression model. All the descriptors are continuous. We want to evaluate on various datasets from the UCI repository the behavior of two imputations methods: the univariate approach and the multivariate approach. The constraint is that the imputation models must rely on information from the learning sample. We consider that this last one does not contain missing values.

We note that the occurrence of the missing value on the instance to classify is "missing completely at random" in our experiments i.e. each descriptor has the same probability to be missing.

Keywords: missing values, missing features, classification model, logistic regression, multiple linear regression, r software, glm, lm, NA
Components: Binary Logistic Regression
Tutorial: : en_Tanagra_Missing_Values_Deployment.pdf
Dataset and programs (R language): md_logistic_reg_deployment.zip
References:
Howell, D.C., "Treatment of Missing Data".
M. Saar-Tsechansky, F. Provost, “Handling Missing Values when Applying Classification Models”, JMLR, 8, pp. 1625-1657, 2007.

Sunday, October 14, 2012

Handling Missing Values in Logistic Regression

The handling of missing data is a difficult problem. Not because of its management which is simple, we just report the missing value with a specific code, but rather because of the consequences of their treatment on the characteristics of the models learned on the treated data.

We have already analyzed this problem in a previous paper. We studied the impact of the different techniques of missing values treatment on a decision tree learning algorithm (C4.5). In this paper, we repeat the analysis by examining their influence on the results of the logistic regression. We consider the following configuration: (1) missing values are MCAR, we wrote a program which removes randomly some values in the learning sample; (2) we apply logistic regression on the pre-treated training data i.e. on a dataset on which we apply a missing value processing technique; (3) we evaluate the different techniques of treatment of missing data by observing the accuracy rate of the classifier on a separate test sample which has no missing values.

In a first time, we conduct the experiments with R. We compare the listwise deletion approach to the univariate imputation (the mean for the quantitative variables, the mode for the categorical ones). We will see that this latter is a very viable approach in MCAR situation. In a second time, we will study the available tools in Orange, Knime and RapidMiner. We will observe that despite their sophistication, they are not better than the univariate imputation in our context.

Keywords: missing value, missing data, logistic regression, listwise deletion, casewise deletion, univariate imputation, missing data, R software, glm
Tutorial: en_Tanagra_Missing_Values_Imputation.pdf
Dataset and programs: md_experiments.zip
References:
Howell, D.C., "Treatment of Missing Data".
Allison, P.D. (2001), « Missing Data ». Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Thousand Oaks, CA : Sage.
Little, R.J.A., Rubin, D.B. (2002), « Statistical Analysis with Missing Data », 2nd Edition, New York : John Wiley.

Monday, September 24, 2012

Tanagra - Version 1.4.47

Non iterative Principal Factor Analysis (PFA). This is an approach which tries to detect underlying structures in the relationships between the variables of interest. Unlike PCA, the PFA is focused only on the shared variances of the set of variables. It is suited when the goal is to uncover the latent structure of the variables. It works on a slightly modified version of the correlation matrix where the diagonal, the prior communality estimate of each variable, is replaced by its squared multiple correlation with all others.

Harris Component Analysis. This is a non-iterative factor analysis approach. It tries to detect underlying structures in the relationships between the variable of interest. Like Principal Factor Analysis, it focuses on the shared variances of the set of variables. It works on a modified version of the correlation matrix.

Principal Component Analysis. Two functionalities are added: the reproduced and residual correlation matrices can be computed, the variables can be sorted according to the loadings in the output tables.

These three components can be combined with the FACTOR ROTATION component (varimax or quartimax).

They can be combined to the re-sampling approaches for the detection of the relevant number of factors (PARALLEL ANALYSIS and BOOTSTRAP EIGENVALUES).

Download page : setup

Saturday, September 1, 2012

Tanagra - Version 1.4.46

AFDM (Factor analysis for mixed data). It extends the principal component analysis (PCA) to data containing a mixture of quantitative and qualitative variables. The method is developed by Pagès (2004). A tutorial will come to describe the use of the method and the reading of the results.

Download page : setup

Friday, July 6, 2012

CVM and BVM from the LIBCVM toolkit

The Support Vector Machines algorithms are well-known in the supervised learning domain. They are especially appropriate when we handle a dataset with a large number “p” of descriptors . But they are much less efficient when the number of instances “n” is very high. Indeed, a naive implementation is of complexity O(n^3) for the calculation time and O(n^2) for the storing of the values. In consequence, instead of the optimal solution, the learning algorithms often highlight the near-optimal solutions with a tractable computation time .

I recently discovered the CVM (Core Vector Machine) and BVM (Ball Vector Machine) approaches. The idea of the authors is really clever: since only approximate best solutions can be highlighted, their approaches try to resolve an equivalent problem which is easier to handle - the minimum enclosing ball problem in computational geometry - to detect the support vectors. So, we have a classifier which is as efficient as those obtained by the other SVM learning algorithms, but with an enhanced ability to process datasets with a large number of instances.

I found the papers really interesting. They are all the more interesting that all the tools enabling to reproduce the experiments are provided: the program and the datasets. So, all the results shown in the paper can be verified. It contrasts to too numerous papers where some authors flaunt tremendous results but we can never reproduce them.

The CVM and BVM methods are incorporated into the LIBCVM library. This is an extension of the LIBSVM (version 2.85), which is already included into Tanagra. The source code for LIBCVM being available, I compiled it as a DLL (Dynamic-link Library) and I included it also into Tanagra 1.4.44.

In this tutorial, we describe the behavior of the CVM and BVM supervised learning methods on the "Web" dataset available on the website of the authors. We compare the results and the computation time to those of the C-SVC algorithm based on the LIBSVM library.

Keywords: support vector machine, svm, libcvm, cvm, bvm, libsvm, c-svc
Components: SELECT FIRST EXAMPLES, CVM, BVM, C-SVC
Tutorial: en_Tanagra_LIBCVM_library.pdf
Dataset: w8a.txt.zip
References :
I.W. Tsang, A. Kocsor, J.T. Kwok : LIBCVM Toolkit, Version: 2.2 (beta)
C.C Chang, C.J. Lin : LIBSVM -- A Library for Support Vector Machines

Wednesday, July 4, 2012

Revolution R Community 5.0

The R software is a fascinating project. It becomes a reference tool for the data mining process. With the R package system, we can extend its features potentially at the infinite. Almost all existing statistical / data mining techniques are available in R.

But if there are many packages, there are very few projects which intend to improve the R core itself. The source code is freely available. In theory anyone can modify a part or even the whole software. Revolution Analytics proposes an improved version of R. It provides Revolution R Enterprise, it seems (according to their website) that: it improves dramatically the fastness of some calculations; it can handle very large database; it provides a visual development environment with a debugger. Unfortunately, this is a commercial tool. I could not check these features . Fortunately, a community version is available. Of course, I have downloaded the tool to study its behavior.

Revolution R Community is a slightly improved version of the Base R. The enhancements are essentially related to the calculations performances: it incorporates the Intel Math Kernal libary, which is especially efficient for the matrix calculations; it can take advantage also, in some circumstances, from the power of the multi-core processors. Performance benchmarks are available on the editor's website. The results are impressive. But we note that they are based on datasets generated artificially.

In this tutorial, we extend the benchmark to other data mining methods. We analyze the behavior of the Revolution R Community 5.0 - 64 bit version in various contexts: binary logistic regression (glm); linear discriminant analysis (lda from the MASS package); induction of decision trees (rpart from the rpart package); principal component analysis based on two different principles, the first one is based on the calculations of the eigenvalues and eigenvectors from the correlation matrix (princomp), the second one is done by a singular value decomposition of the data matrix (prcomp).

Keywords: R software, revolution analytics, revolution r community, logistic regression, glm, linear discriminant analysis, lda, principal components analysis, acp, princomp, prcomp, matrix calculations, eigenvalues, eignevectors, singular value decomposition, svd, decision tree, cart, rpart
Tutorial: en_Tanagra_Revolution_R_Community.pdf
Dataset: revolution_r_community.zip
References :
Revolution Analytics, "Revolution R Community".

Monday, July 2, 2012

Introduction to SAS proc logistic

In my courses at the University, I use only free data mining tools (R, Tanagra, Sipina, Knime, Orange, etc.) and the spreadsheet applications (free or not). Sometimes, my students ask me if the commercial tools (e.g. SAS which is very popular in France) have different behavior, in terms of how to use, or for the reading of the results. I say them that some of these commercial tools are available on the computers of our department. They can learn how to use them by taking as a starting point the tutorials available on the Web.

But unfortunately, especially in the French language, they are not numerous about the logistic regression. We need a didactic document with clear screenshots which show how to: (1) import a data file into a SAS bank; (2) define an analysis with the appropriate settings; (3) read and understand the results.

In this tutorial, we describe the use of the SAS PROC LOGISTIC (SAS 9.3). We measure its quickness when we handle a moderate sized dataset. We compare the results with those of Tanagra 1.4.43.

Keywords: sas, proc logistic, binary logistic regression
Components: BINARY LOGISTIC REGRESSION
Tutorial: en_Tanagra_SAS_Proc_Logistic.pdf
Dataset: wave_proc_logistic.zip
References :
SAS - "The LOGISTIC Procedure"
Tanagra - "Logistic regression - Software comparison"
Tanagra - "Logistic regression on large dataset"

Saturday, June 30, 2012

SAS Add-In 4.3 for Excel

The connection between a data mining tool and a spreadsheet application such as Excel is a really valuable feature. We benefit from the powerful of the first one, and the popularity and the easy to use of the second one. Many people use a spreadsheet in their data preparation phase. Recently, I have presented an add-in for the connection between R and Excel. In this document, I describe a similar tool for the SAS software.

SAS is a popular tool, well-known of the statisticians. But the use of SAS is not really simple for the non-specialist people. We must know the syntax of the commands before to perform a statistical analysis. With the SAS add-in for Excel, some of the SAS drawbacks are alleviated: we do not need to load and organize the dataset into a bank; we do not need to know the command syntax to perform an analysis and set the associated parameters (we use a menu and dialog boxes instead); the results are automatically incorporated in a new sheet of an Excel workbook (the post processing of the results becomes easy).

In this tutorial, I describe the behavior of the add-in for various kinds of analyses (nonparametric statistic, logistic regression). We compare the results with those of Tanagra.

Keywords: excel, sas, add-on, add-in, logistic regression, nonparametric test
Components : MANN-WHITNEY COMPARISON, KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, ANSARI-BRADLEY SCALE TEST, KLOTZ SCALE TEST, MOOD SCALE TEST
Tutorial: en_Tanagra_SAS_AddIn_4_3_for_Excel.pdf
Dataset: scoring_dataset.xls
References :
SAS - http://www.sas.com/
SAS - "SAS Add-in for Microsoft Office"
Tanagra Tutorial - "Tanagra Add-In for Office 2007 and Office 2010"

Tuesday, June 12, 2012

Tanagra - Version 1.4.45

New features for the principal component analysis (PCA).

PRINCIPAL COMPONENT ANALYSIS. Additional outputs for the component: Scree plot and variance explained cumulative curve; PCA Correlation Matrix - Some outputs are provided for the detection of the significant factors (Kaiser-Guttman, Karlis-Saporta-Spinaki, Legendre-Legendre broken-stick test); PCA Correlation Matrix - Bartlett's sphericity test is performed and the Kaiser's measure of sampling adequacy (MSA) is calculated; PCA Correlation Matrix - The correlation matrix and the partial correlations between each pair of variables controlling for all other variables (the negative anti-image correlation) are produced.

PARALLEL ANALYSIS. The component calculates the distribution of eigenvalues for a set of randomly generated data. It proceeds by randomization. It applies to the principal components analysis and te multiple correspondence analysis. A factor is considered significant if its observed eigenvalue is greater than the 95-th percentile (this setting can be modified).

BOOTSTRAP EIGENVALUES. It calculates by bootstrap approach the confidence intervals of eigenvalues. A factor is considered significant if its eigenvalue is greater than a threshold which depends on the underlying factor method (PCA or MCA) method, or if the lower bound of the eigenvalue of a factor is greater than higher bound of the following one. The confidence level 0.90 can be modified. This component can be applied to the principal component analysis or the multiple correspondence analysis.

JITTERING. Jittering feature is incorporated to the scatter plot components (SCATTERPLOT, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL, VIEW MULTIPLE SCATTERPLOT).

RANDOM FOREST. The not used memory is released after the decision tree learning process. This feature is especially useful when we use an ensemble learning approach where we store a large number of trees in memory (BAGGING, BOOSTING, RANDOM FOREST). The memory occupation is reduced. The computation capacity is improved.

Download page : setup

Monday, May 14, 2012

Tanagra - Version 1.4.44

LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). Update of the LIBSVM library for support vector machine algorithms (version 3.12, April 2012) [C - SVC, Epsilon-SVR, nu - SVR]. The calculations are faster. The attributes can be normalized or not. They were automatically normalized previously.

LIBCVM (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html; version 2.2). Incorporation of the LIBCVM library. Two methods are available: CVM and BVM (Core Vector Machine and Ball Vector Machine). The dezscriptors can be normalized or not.

TR-IRLS (http://autonlab.org/autonweb/10538). Update of the TR-IRLS library, for the logistic regression on large dataset (large number of predictive attributes) [last available version – 2006/05/08]. The deviance is automatically provided. The display of the regression coefficients is more precise (higher number of decimals). The user can tune the learning algorithms, especially the stopping rules.

SPARSE DATA FILE. Tanagra can handle sparse data file format now (see SVMlight ou libsvm file format). The data can be used for supervised learning process or regression problem. A description of this kind of file is available on line (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html).

INSTANCE SELECTION. A new component for the selection of the m first individuals among n in a branch of the diagram is available [SELECT FIRST EXAMPLES]. This option is useful when the data file is the result of the concatenation of the learning and test samples.

Download page : setup

Thursday, May 3, 2012

Using PDI-CE for model deployment (PMML)

Model deployment is a crucial task of the data mining process. In the supervised learning, it can be the applying of the predictive model on new unlabeled cases. We have already described this task for various tools (e.g. Tanagra, Sipina, Spad, R). They have as common feature the use of the same tool for the model construction and the model deployment.

In this tutorial, we describe a process where we do not use the same tool for the model construction and the model deployment. This is only possible if (1) the model is described in a standard format, (2) the tool which used for the deployment can handle both the database with unlabeled instances and the model. Here, we use the PMML standard description for the sharing of the model, and the PDI-CE (Pentaho Data Integration Community Edition) for the applying of the model on the unseen cases.

We create a decision tree with various tools such as SIPINA, KNIME or RAPIDMINER; we export the model in the PMML format; then, we use PDI-CE for applying the model on a data file containing unlabeled instances. We see that the use of the PMML standard enhances dramatically the powerful of both the data mining tool and the ETL tool.

In addition, we will describe other solutions for deployment in this tutorial. We will see that Knime has its own PMML reader. It is able to apply a model on unlabeled datasets, whatever the tool used for the construction of the model. The key is that the PMML standard is respected. In this sense, Knime can be substituted to PDI-CE. Another possible solution, Weka, which is included into the Pentaho Community Edition suite, can export the model in a proprietary format that PDI-CE can handle.

Keywords: model deployment, predictive model, pmml, decision tree, rapidminer 5.0.10, weka 3.7.2, knime 2.1.1, sipina 3.4
Tutorial: en_Tanagra_PDI_Model_Deployment.pdf
Dataset: heart-pmml.zip
References:
Data Mining Group, "PMML standard"
Pentaho, "Pentaho Kettle Project"
Pentaho, "Using the Weka Scoring Plugin"

Sunday, April 22, 2012

Pentaho Data Integration - Kettle

The Pentaho BI Suite is an open source Business Intelligence suite with integrated reporting, dashboard, data mining, workflow and ETL capabilities (http://en.wikipedia.org/wiki/Pentaho).

In this tutorial, we talk about the Pentaho BI Suite Community Edition (CE) which is freely downloadable. More precisely, we present the Pentaho Data Integration (PDI-CE) , called also Kettle. We show briefly how to load a dataset and perform a simplistic data analysis. The main goal of this tutorial is to introduce a next one focused on the deployment of the models designed with Knime, Sipina or Weka by using PDI-CE.

This document is based on the 4.0.1 stable version of PDI-CE.

Keywords: ETL, pentaho data integration, community edition, kettle, BI, business intelligence, data importation, data transformation, data cleansing
Tutorial: PDI-CE
Dataset: titanic32x.csv.zip
References :
Pentaho, Pentaho Community

Monday, April 9, 2012

Mining frequent itemsets

Searching regularities from dataset is the main goal of the data mining. They may have various representations. In the market basket analysis, we search the co occurrences of goods (items) i.e. the goods which are often purchased simultaneously. They are called “frequent itemset”. For instance, one result may be "milk and bread are purchased simultaneously in 10% of caddies".

Frequent itemset mining is often presented as the preceding step of the association rule learning algorithm. At the end of the process, we highlight the direction of the relation. We obtain rules. For instance, a rule may be "90% of the customers which buy milk and bread will purchase butter also". This kind of rule can be used in various manners. For instance, we can promote the sales of milk and bread in order to increase the sales of butter.

In fact, frequent itemsets provide also valuable information. Detecting the goods which are purchased simultaneously enables to understand the relation between them. It is a kind of variant of the clustering analysis. We search the items which come together. For instance, we can use this kind of information in order to reorganize the shelves of the store.

In this tutorial, we describe the use of the FREQUENT ITEMSETS component under Tanagra. It is based on the Borgelt's “apriori.exe” program. We use a very small dataset. It enables to everyone to reproduce manually the calculations. But, in a first time, we describe some definitions about the frequent itemset mining process.

Keywords: frequent itemsets, closed itemsets, maximal itemsets, generator itemsets, association rules, R software, arules package
Components: FREQUENT ITEMSETS
Tutorial: en_Tanagra_Itemset_Mining.pdf
Dataset: itemset_mining.zip
References :
C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"
R. Lovin, "Mining Frequent Patterns"

Sunday, April 1, 2012

Sipina add-on for OOCalc

Combining a spreadsheet with the data mining tools is essential for the popularity of these last ones. Indeed, when we deal with a moderate sized dataset (thousands of rows and tens of variables), the spreadsheet is a practical tool for the data preparation. This is also a valuable tool for the preparation of the reports. It is thus not surprising that Excel, and generally speaking a spreadsheet, is one the most used tool by data miners.

Both Tanagra and Sipina provide an add-on for Excel. The add-on enables to insert a data mining tool menu into the spreadsheet. The user can select and send the dataset to Tanagra (or Sipina), which is automatically launched. But, only Tanagra provides an add-on for Open Office Calc and Libre Office Calc. It is not available for Sipina.

This omission has been corrected for this new version of Sipina (Sipina 3.9). In this tutorial, we show how to install and use the “SipinaLibrary.oxt” add-on for Open Office Calc 3.3.0 (OOCalc). The process is the same for Libre Office 3.5.1.

Keywords: calc, open office, libre office, oocalc, add-on, add-in, sipina
Tutorial: en_sipina_calc_addon.pdf
Dataset: heart.xls
References :
Tutoriel Tanagra - Sipina add-in for Excel
Tutoriel Tanagra - Tanagra add-on for Open Office Calc 3.3
Open Office - http://www.openoffice.org
Libre Office - http://www.libreoffice.org/

Thursday, March 29, 2012

Tanagra - Version 1.4.43

A few bugs have been fixed and some new features added.

The computed contributions of individuals in PCA (PRINCIPAL COMPONENT ANALYSIS) have been corrected. It was not valid when we work on a subsample of our data file. This error has been reported by Mr. Gilbert Laffond.

The standardization of the factors after VARIMAX (FACTOR ROTATION) have been corrected so that their variance coincides with the sum of the squares of the correlations with the axes, and thus with the eigen value associated to the axis. This modification has been suggested by Mr. Gilbert Laffond.

During the calculation of the confidence interval of the PLS regression coefficients (PLS CONF. INTERVAL), an error may occur when the requested number of axes was upper than the number of predictor variables. It is now corrected. This error has been reported by Mr. Alain Morineau.

In some circumstances, an error may occur in FISHER FILTERING, especially when Tanagra is run under Wine for Linux. We introduce some additional checking. This error has been reported by Mr. Bastien Barchiési.

The checking of missing values is now optional. The performance can be preferred for the treatment of very large files. We find the performances of 1.4.41 and previous versions.

The "COMPONENT / COPY RESULTS" menu sends information in HTML format. It is now compatible with the spreadsheet Calc of Libre Office 3.5.1. It was operating with the Excel spreadsheet only before. Curiously, the copy to the OOCalc (Open Office spreadsheet) is not possible at the present time (Open Office 3.3.0).

Donwload page : setup

Friday, March 23, 2012

Sipina - Version 3.9

The add-on “SipinaLibrary.oxt” was added to the distribution. An additional menu is incorporated into spreadsheet OOCalc. It enables to launch SIPINA from a dataset (range of cells). The add-on operates with Open Office (tested for version 3.3.0) and Libre Office (version 3.5.1).

Note that a similar add-on exists for Excel (sipina.xla). It allows to make a connection between Sipina and Excel.

Keywords: sipina, OOCalc, open office, libre office, add-on, add-in
Sipina website: Sipina
Download: Setup file
References:
Tanagra - SIPINA add-in for Excel
Tanagra - Tanagra add-in for Excel 2007 and 2010
Open Office - http://www.openoffice.org/
Libre Office - http://www.libreoffice.org/

Wednesday, March 21, 2012

RExcel, a bridge between Excel and R

Combining a specialized data mining tool with a spreadsheet is a very interesting idea. Most of the people know handle a spreadsheet such as Excel (but also LibreOffice Calc, Open Office Calc, Gnumeric, etc.). It is really popular because it is a very easy to use tool for data manipulation.

Many data mining tools can read XLS or XLSX file formats. But, it is even more interesting to implement a bridge between the data mining tools and Excel in a bidirectional way. So, we can lead easily the whole analysis by navigating between the tools: transforming the variables into Excel, performing the analysis into the data mining tool, and post-processing the results into Excel.

In this tutorial, we describe RExcel library for R. It sets a new menu into Excel. Thus, we can send a dataset to R on the one hand; retrieve dataset or more generally a vector or a matrix from R on the other hand. The tool is really easy to use.

Keywords: data importation, excel file format, xls, xlsx, addin, add-in, addon, add-on, multiple linear regression
Components: lm, stepAIC, predict
Tutorial: en_Tanagra_RExcel.pdf
Dataset: ventes_regression_rexcel.zip
References :
T. Baier, E. Neuwirth, "Powerful data analysis from inside your favorite application"

Sunday, March 4, 2012

PSPP, an alternative to SPSS

I spend a lot of time to analyze the available free statistical and data mining tools. There is not bad software, but some tools are more appropriate for some tasks. Thus, we must identify the one which is the best suited to our configuration. For that, we must know a large number of tools.

In this tutorial, we describe PSPP. It is presented as an alternative to the well-known SPSS: “PSPP is a program for statistical analysis of sampled data. It is a free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions”. Instead of to describe in detail each feature, the documentation is available on the website, we present some statistical techniques. We compare the results with those of Tanagra, R 2.13.2 and OpenStat (build 24/02/2012). This is also a way to validate them. If they provide different results, it means that there is a problem.

Keywords: pspp, R software, openstat, spss, descriptive statistics, t-test , welch test, comparison of means, comparison of variances, levene's test, chi-squared test, contingency table, cross tabs, analysis of variance, anova, multiple regression, roc curve, auc, area under curve
Components: MORE UNIVARIATE CONT STAT, GROUP CHARACTERIZATION, CONTINGENCY CHI-SQUARE, LEVENE'S TEST, T-TEST, T-TEST UNEQUAL VARIANCE, PAIRED T-TEST, ONE-WAY ANOVA, MULTIPLE LINEAR REGRESSION, ROC CURVE
Tutorial: en_Tanagra_PSPP.pdf
Dataset: autos_pspp.zip
References:
GNU PSPP, http://www.gnu.org/software/pspp/
R Project for Statistical Computing, http://www.r-project.org/
OpenStat, http://www.statprograms4u.com/

Friday, March 2, 2012

Regression analysis with LazStats (OpenStat)

LazStat is a statistical software which is developed by Bill Miller, the father of OpenStat, a well-know tool by statisticians since many years. These are tools of the highest quality. OpenStat is one of tools that I use when I want to validate my own implementations.

Several variants of OpenStat are available. In this tutorial, we study LazStat . It is a version programmed in Lazarus, a development environment which is very similar to Delphi. It is based on the Pascal language. Projects developed in Lazarus benefit to the "write once, compile anywhere" principle i.e. we write our program on an OS (e.g. Windows), but we can compile it on any OS as long as Lazarus and the compiler are available (e.g. Linux). This idea has been proposed by Borland with Kylix some years ago. We could program a project for both Windows and Linux. But, unfortunately, Kylix has been canceled. It seems that the Lazarus is more mature. In addition, it enables us also to compile the same project for the 32 bit and 64 bit versions of an OS.

In this tutorial, we present some functionality of LazStats about regression analysis.

Keywords: linear regression, multiple regression, variable selection, forward, backward, stepwise, simultaneous regression
Tutorial: en_Tanagra_Regression_LazStats.pdf
Dataset: conso_vehicules_lazstats.txt
References :
LazStats - http://www.statprograms4u.com/
Lazarus - http://www.lazarus.freepascal.org/

Sunday, February 19, 2012

Checking missing values in Tanagra

Up to the 1.4.41 version, Tanagra does not handle missing values because it seems interesting to force the students, which are the main users of Tanagra, to think about and to propose the most appropriate solution in relation with the characteristics of their dataset and the goal of their analysis. Thus, Tanagra simply truncates the file to import from the first obstacle. This treatment often disconcerts the users, especially since no error message was sent. They wondered why, whereas the conditions look right, the data were not properly loaded.

From Tanagra 1.4.42 version, the importation of the text file format (tab separator), of the XLS file format (Excel 97-2003), and the data transfer using the add-in for Excel (up to Excel 2010 ) and LibreOffice 3.5/OpenOffice 3.3, have been modified. Tanagra reads all rows of the base. But it skips the incomplete rows and / or with inconsistencies (e.g. a column contains numeric value whereas this is a discrete attribute). And above all, an explicit error message counts the number of deleted rows. Thus, the users are better informed.

In this tutorial, we show the management of missing data when we send the data from Excel to Tanagra using the add-in Tanagra.xla. Some cells are empty into the Excel data range. This example illustrates the new behavior of Tanagra. We would get the same behavior if we import directly the XLS file or if we imported the corresponding file into the TXT format.

Keywords: missing values, missing data, inconsistent values, text file format importation, excel file format importation, add-in, add-in, tanagra.xla
Components: DATASET, VIEW DATASET
Tutorial: en_Tanagra_Missing_Data_Checking.pdf
Dataset: ronflement_with_missing_empty.zip
References:
Wikipedia, "Listwise deletion".
D.C. Howell, "Treatment of missing data".

Friday, February 10, 2012

Logistic regression on large dataset

The programming of fast and reliable tools is a constant challenge for a computer scientist. In the data mining context, this leads to a better capacity to handle large datasets. When we build the final model that we want to deploy, the quickness is not really important. But in the exploratory phase where we search the best model, it is decisive. It improves our chance to obtain the best model simply because we can try more configurations.

I have tried many solutions to improve the calculation times of the logistic regression. In fact, I think the performance rests heavily on the optimization algorithm used. The source code of Tanagra shows that I have greatly hesitated. Some studies have helped me about the right choice.

Several tools propose the logistic regression. It is interesting to compare their calculation times and memory occupation. I have already studied this kind of comparison in the past . The novelty here is that I use a new operating system (64 bit version of Windows 7), and some tools are especially intended for this system. The calculating capabilities are greatly improved for these tools. For this reason, I have increased the dataset size. Moreover, to make more difficult the variable selection process, I added predictive attributes that are correlated to the original descriptors, but not to the class attribute. They have not to be selected in the final model.

In this paper, in addition to Tanagra 1.4.14 (32 bit), we use R 2.13.2 (64 bit), Knime 2.4.2 (64 bit), Orange 2.0b (build 15 oct2011, 32 bit) and Weka 3.7.5 (64 bit).

Keywords: logistic regression, software comparison, glm, stepAIC, R software, knime, orange, weka
Components: BINARY LOGISTIC REGRESSION, FORWARD LOGIT
Tutorial: en_Tanagra_Perfs_Bis_Logistic_Reg.pdf
Dataset: perfs_bis_logistic_reg.zip
References:
Tanagra, "Logistic regression - Software comparison", december 2008.
T.P. Minka, « A comparison of numerical optimizers for logistic regression », 2007.

Saturday, February 4, 2012

Tanagra - Version 1.4.42

The Tanagra.xla add-in for Excel can work now for both the 32 and 64-bit versions of EXCEL.

With the FastMM memory manager, Tanagra can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows. The processing capabilities, especially about the handling of large datasets, are improved.

The importation of the tab-delimited text file format and xls file format (Excel 97-2003) is made safer. Previously, the importation is interrupted and the dataset is truncated when an invalid line is read (with missing or inconsistent values). Now, Tanagra skips the line and continues on the next rows. The number of skipped lines is reported into the importation report.

Donwload page : setup

Wednesday, January 18, 2012

ARS into the SIPINA package

Association Rule Software (ARS) is a basic tool which extracts association rules from attribute-value datasets (categorical or binary attributes). It is distributed with the SIPINA package which includes: a tool for the supervised learning framework, especially the decision tree induction (SIPINA RESEARCH); a tool for the linear regression (REGRESS); and thus, ARS for the association rule mining.

ARS encodes automatically the categorical attributes in dummy variables. If you want use a continuous attributes, you must discretize them before.

This tutorial describes shortly the use of the Association Rule Software (ARS). Compared with the previous version, the GUI of the one incorporated into the SIPINA 3.8 package is simplified.

Keywords: association rule mining, support, confidence, lift, conviction
Download: Sipina setup file
Tutorial: How to use ARS
References:
Wikipedia - Association rule learning

Sipina - Version 3.8

The tools (SIPINA RESEARCH, REGRESS and ASSOCIATION RULE SOFTWARE) included in the SIPINA distribution have been updated with some improvements.

SIPINA.XLA. The add-in for Excel can work now with either for the 32 or 64-bit versions of EXCEL.

Importation of text data files. Processing time has been improved. This improvement reduces also the transferring time when we use the SIPINA.XLA add-in for Excel (which uses a temporary file in the text file format).

Association rule software. The GUI has been simplified; the display of the rules is made more readable.

Because they are internally based on the FastMM memory management, these tools can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows. The processing capabilities are improved.

Keywords: sipina, decision tree induction, association rule, multiple linear regression
Sipina website: Sipina
Download: Setup file
References:
Tanagra - SIPINA add-in for Excel
Tanagra - Tanagra add-in for Excel 2007 and 2010
Delphi Programming Resource - FastMM, a Fast Memory Manager

Monday, January 2, 2012

Tanagra website statistics for 2011

The year 2011 ends, 2012 begins. I wish you all a very happy year 2012.

A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 281,352 times this year, 770 visits per day. For comparison, we had 662 daily visits in 2010, 520 in 2009, 349 in 2008.

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries. In terms of non-francophone countries, we observe mainly the United States, India, UK, Italy, Brazil, Germany,...

Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Mining: course materials, tutorials, links to other documents available on line, etc.. This is not really surprising. I take more time myself to write booklets and tutorials, to study the behavior of different software, of which Tanagra.

Happy New Year 2012 to all.

Ricco.
Slideshow: Website statistics for 2011