Tanagra - Data Mining and Data Science Tutorials: 2011

Saturday, December 31, 2011

Tanagra add-in for Excel 2010 - 64-bit version

The current Tanagra.xla add-in is valid to the 32-bit version of Excel (up to Excel 2010), even if we are working under 64-bit version of Windows. It does not operate on the other hand if we want to connect the 64-bit version of Excel to Tanagra. We must modify the add-in source code. These modifications are needed up to 1.4.41 version of Tanagra. They will be automatically introduced for the upcoming versions.

In this tutorial, we show the procedure to be followed for this upgrade. The screenshots have been achieved under a French version of Excel 2007 here, but I think (I hope) that the adaptation to other versions (Excel 2010 and/or other languages) is easy.

Thank you very much to Mrs. Nathalie Jourdan-Salloum which has pointed out this problem and has suggested to me the right solution.

Keywords: data importation, xls, xlsx, excel file format, macro-complémentaire, add-in, addin, add-on
Tutorial: en_Tanagra_Addin_Excel_64_bit.pdf
References:
Tanagra, "Tanagra add-in for Office 2007 and Office 2010".

Sunday, December 11, 2011

Dealing with very large dataset (continuation)

Because I have recently updated my operating system (OS), I am wondering how the 64-bit versions of Knime 2.4.2 and RapidMiner 5.1.011 could handle a very large dataset, which cannot be loaded into main memory on a 32-bit OS. This article completes a previous study where we deal with a moderate sized dataset with 500,000 instances and 22 variables. Here, we handle a dataset with 9,634,198 instances and 41 variables. We have already used this dataset in another tutorial. We showed that we cannot perform a decision tree induction on this kind of database without a swapping system, which is implemented into the SIPINA, on a 32-bit OS. We note that Tanagra can handle the dataset, but this is because it encodes the values of the categorical attributes with a byte. The memory occupation remains moderate.

In this tutorial, I analyze the behavior of the 64-bit Knime and RapidMiner on this database. I use 64-bit OS and tools, but I have "only" 4 GB of available memory on my personal computer.

Keywords: very large dataset, decision tree, sampling, sipina, knime, rapidminer
Components: ID3
Tutorial: en_Tanagra_Tree_Very_Large_Dataset.pdf
Dataset: twice-kdd-cup-discretized-descriptors.zip
References:
Tanagra, "Dealing with very large dataset in Sipina".
Tanagra, "Decision tree and large dataset (continuation)".
Tanagra, "Decision tree and large dataset".
Tanagra, "Local sampling for decision tree learning".

Saturday, October 29, 2011

Decision tree and large dataset (continuation)

One of the exciting aspects of computing is that things are changing very quickly. The machines are ever more efficient, the operating systems are improved, the software also. Since writing an old tutorial about the induction of decision tree on a large dataset, I have a new computer and I use a 64 bit OS (Windows 7). Some of the tools studied propose a 64 bit version (Knime, RapidMiner, R). I wonder how behave the various tools in this new context. To do that, I renewed the same experiment.

We note that a more efficient computer allows to improve the computation time (about 20%). The specific gain for a 64 bit version is relatively low, but it is real (about 10%). And some tools are clearly improved their programming of the decision tree induction (Knime, RapidMiner). On the other hand, we observe that the memory occupation remains stable for the most of the tools in the new context.

Keywords: c4.5, decision tree, large dataset, wave dataset, knime2.4.2, orange 2.0b, r 2.13.2, rapidminer 5.1.011, sipina 3.7, tanagra 1.4.41, weka 3.7.4, windows 7 - 64 bits
Components: SUPERVISED LEARNING, C4.5
Tutorial: en_Tanagra_Perfs_Comp_Decision_Tree_Suite.pdf
Screenshots : Experiment screenshots.
Dataset: wave500k.zip
References:
Tanagra, "Decision tree and large dataset".
R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.

Sunday, September 25, 2011

A PRIORI PT updated

A PRIORI PT is a tool dedicated for the extraction of association rules. This is one of the few components of Tanagra based on external library. We use the Borgelt's "apriori.exe" program. Until the version 1.4.40 of Tanagra, we used the 4.31 version of "apriori.exe". From the Tanagra 1.4.41, we introduce the latest update 5.57 (2011/09/02). Even if the settings of the tool are slightly modified, we observe that the extracted rules and the readings of the results are identical.

We take again a former tutorial to describe the behavior of this component (Association Rule Learning using A PRIORI PT). Thus, we do not detail the construction of the diagram here. We try above all to highlight the improvement of the library, especially about the computation time. We observe that this improvement is really impressive.

Keywords: association rule, large dataset
Components: A priori PT
Tutorial: en_Tanagra_AprioriPT_Updated.pdf
Dataset: assoc_census.zip
Reference: C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"

Thursday, September 22, 2011

Tanagra - Version 1.4.41

A PRIORI PT. This component generates association rules. It is based on the Borgelt's apriori.exe program which has been recently updated (2011/09/02 - 5.57 version). The improvement of this new version, in terms of calculation time, is impressive.

FREQUENT ITEMSETS. Also based on the Borgelt's apriori.exe program (version 5.57), this component generates frequent (or closed, maximum, generators) itemsets.

Some tutorials are coming soon to describe the use of these new tools.

Donwload page : setup

Tuesday, September 20, 2011

New GUI for RapidMiner 5.0

RapidMiner is a very popular data mining tool. It is (one of) the most used by the data miners according to the annual Kdnuggets polls (2011, 2010, 2009, 2008, 2007). There are two versions. We describe here the Community Edition which freely downloadable from the editor's website.

The new RapidMiner 5.0 has a new graphical user interface which is very similar to that of Knime. The organization of the workspace is the same. The sequence of data processing (using operators) is described with a diagram called "process" into the RapidMiner documentation. In fact, this version 5.0 joined the presentation adopted by the vast majority of data mining software. Some features are shared with many tools, among others: the connection to the R software; the meta-nodes which implements a loop or a standard succession of operations; the description of the methods underlying operators which is continuously in the right part of the main window.

RapidMiner 5.0 having evolved substantially (compared with previous versions e.g. the version 4.6 described in one of our tutorials). I thought it was appropriate to study this in detail, evaluating its behavior in the context of a standard data mining analysis. We want to implement the following process: (1) creating a decision tree from a labeled dataset; (2) exporting the model (the classification tree) into a external file (PMML format) in order to a deployment thereafter; (3) assessing the model performance using a cross-validation resampling scheme; (4) applying the model on a set of unlabeled instances, the results, i.e. the values of the descriptors and the assigned class, must be exported into a CSV file. These are standard data mining tasks. We have described them in many tutorials. We want to check if it is easy to implement them with this new version of RapidMiner. Indeed, with the previous version, defining some sequences of operations was complicated. Implementing a cross-validation for instance was not really intuitive.

Keywords: rapidminer, knime, cross-validation, decision tree, classification tree, deployment
Tutorial: en_Tanagra_RapidMiner_5.pdf
Dataset: adult_rapidminer.zip
References:
Rapid-I, "RapidMiner"
Knime, "Knime Desktop"

Monday, September 19, 2011

Regression model deployment

Model deployment is one of the main objectives of the data mining process. We want to apply a model learned on a training set on unseen cases i.e. any people coming from the population. In the classification framework, the aim is to assign to the instance its class value from their description [e.g. Apply a classifier on a new dataset (Deployment)]. In the clustering framework, we try to detect the group which is as similar as possible to the instance according their characteristics (e.g. K-Means - Classification of a new instance).

We are concerned about the regression framework here . The aim is to predict the values of the dependent variable for unseen instances (or unlabeled instances) from the observed values on the independent variables. The process is rather basic if we handle a linear regression model. We apply the computed parameters on the unseen instances. But, it becomes difficult when we want to treat more complex models such as support vector regression with nonlinear kernels, or the models elaborated from a combination of techniques (e.g. regression from the factors of a principal component analysis). In this context, it is essential that the deployment process is directly ensured by the data mining tool.

With Tanagra, it is possible to easily deploy the regression models, even when they are the result of a combination of technique. Simply, we must prepare the data file in a particular way. In this tutorial, we describe below how to organize the data file in order to deploy various models in an unified framework: a linear regression model, a PLS regression model, a support vector regression model with a RBF (radial basis function) kernel, a regression tree model , a regression model from the factors of a principal component analysis. Then, we export the results (the predicted values for the dependent variable) in a new data file. Last, we check if the predicted values are similar according to the various models.

Keywords: model deployment, linear regression, pls regression, support vector regression, SVR, regression tree, cart, principal component analysis, pca, regression of factor scores
Components: MULTIPLE LINEAR REGRESSION, PLS REGRESSION, PLS SELECTION, C-RT REGRESSION TREE, EPSILON SVR, PRINCIPAL COMPONENT ANALYSIS, RECOVER EXAMPLES, EXPORT DATASET, LINEAR CORRELATION
Tutorial: en_Tanagra_Multiple_Regression_Deployment.pdf
Dataset: housing.xls
References :
R. Rakotomalala, Régression linéaire multiple - Diaporama (in French)

Saturday, August 27, 2011

Data Mining with R - The Rattle Package

R (http://www.r-project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is absolutely justified (see Kdnuggets Polls - Data Mining/ Analytic Tools Used - 2011). Among the reasons which explain this success, we distinguish two very interesting characteristics: (1) we can extend almost indefinitely the features of the tool with the packages; (2) we have a programming language which allows to perform easily sequences of complex operations.

But this second property can be also a drawback. Indeed, some users do not want to learn a new programming language before being able to realize projects. For this reason, tools which allow to define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still remain a valuable alternative with the data miners.

In this tutorial, we present the "Rattle" package which allows to the data miners to use R without needing to know the associated programming language. All the operations are performed with simple clicks, such as for any software driven by menus. But, in addition, all the commands are stored. We can save them in a file. Then, in a new working session, we can easily repeat all the operations. Thus, we find one of the important properties which miss to the tools driven by menus.

To describe the use of the rattle package, we perform an analysis similar to the one suggested by the rattle's author in its presentation paper (G.J. Williams, " Rattle : A Data Mining GUI for R", in The R Journal, volume 1 / 2, pages 45-55, December 2009). We perform the following steps: loading the data file; partitioning the instances into learning and test samples; specifying the types of the variables (target or input); computing some descriptive statistics; learning the predictive models from the learning sample; assessing the models on the test sample (confusion matrix, error rate, some curves).

Keywords: R software, R project, rpart, random forest, glm, decision tree, classification tree, logistic regression
Tutorial: en_Tanagra_Rattle_Package_for_R.pdf
Dataset: heart_for_rattle.txt
References:
Togaware, "Rattle"
CRAN, "Package rattle - Graphical user interface for data mining in R"
G.J. Williams, "Rattle: A Data Mining GUI for R", in The R Journal, Vol. 1/2, pages 45--55, december 2009.

Monday, August 22, 2011

Predictive model deployment with R (filehash)

Model deployment is the last task of the data mining steps. It corresponds to several aspects e.g. generating a report about the data exploration process, highlighting the useful results; applying models within an organization's decision making process; etc .

In this tutorial, we look at the context of predictive data mining. We are concerned about the construction of the model from a labeled dataset; the storage of the model; the distribution of the model, without the dataset used for its construction; the application of the model on new instances in order to assign them a class label from their description (the values of the descriptors).

We describe the filehash package for R which allows to deploy a model easily. The main advantage of this solution is that R can be launched under various operating systems. Thus, we can create a model with R under Windows; and apply the model in another environment, for instance with R under Linux. The solution can be easily generalized on a large scale because it is possible to launch R in batch mode. The update of the system will concern only the model file in the future.

We will write three R programs to distinguish the steps of the deployment process. The first one constructs a model from the dataset and stores it into a binary file (filehash format). The second one loads the model in another R session and uses it to label new instances from a second data file. The predictions are stored in a data file (CSV file format). Last, the third program loads the predictions and another data file containing the observed labels for these instances, and calculates the confusion matrix and the generalization error rate.

We use various predictive models in order to check the flexibility of the solutions. We tried the following ones: decision tree (rpart); logistic regression (glm); linear discriminant analysis (lda); linear discriminant analysis from factors of principal component analysis (lda + pca). This last one allowed to check if the system remains operational when we manipulate a combination of models.

Keywords: R software, filehash package, deployment, predictive model, rpart, lda, pca, glm, decision tree, linear discriminant analysis, logistic regression, principal component analysis, linear discriminant analysis on latent variables
Tutorial: en_Tanagra_Deploying_Predictive_Models_with_R.pdf
Dataset: pima-model-deployment.zip
References:
R package, "Filehash : Simple key-value database"
Kdnuggets, "Data mining deployment Poll"

Thursday, August 18, 2011

REGRESS into the SIPINA package

Few people know it. In fact, several tools are installed when we launch the SETUP file of SIPINA (setup_stat_package.exe). This is the case of REGRESS which is intended to multiple linear regression.

Even if a multiple linear regression procedure is incorporated to Tanagra, REGRESS can be useful essentially because it is very easy to use. It has the advantage of being very easy to handle while being consistent with a degree course in Econometrics. As such, it may be useful for anyone wishing to learn about the regression without too much get involved in the learning of a new software.

Keywords: regress, econometrics, multiple linear regression, outliers, influential points, normality tests, residuals, Jarque-Bera test, normal probability plot, sipina.xla, add-in
Tutorial: en_sipina_regress.pdf
Dataset: ventes-regression.xls
References:
R. Rakotomalala, "Econométrie - Régression Linéaire Simple et Multiple".
D. Garson, "Multiple regression".

Sunday, August 14, 2011

PLS Regression - Software comparison

Comparing the behavior of tools is always a good way to improve them.

To check and validate the implementation of methods. The validation of the implemented algorithms is an essential point for data mining tools. Even if two programmers use the same references (books, articles), the programming choice can modify the behavior of the approach (behaviors according to the interpretation of the convergence conditions for instance). The analysis of the source code is possible solution. But, if it is often available for free software, this is not the case for commercial tools. Thus, the only way to check them is to compare the results provided by the tools on a benchmark dataset . If there are divergences, we must explain them by analyzing the formulas used.

To improve the presentation of results. There are certain standards to observe in the production of reports, consensus initiated by reference books and / or leader tools in the field. Some ratios should be presented in a certain way. Users need reference points.

Our programming of the PLS approach is based on the Tenenhaus book (1998) which, itself, make reference to the SIMCA-P tool. Using the access to a limited version of this software (version 11), we have check the results provided by Tanagra on various datasets. We show here the results of the study on the CARS dataset. We extend the comparison to other data mining tools.

Keywords: pls regression, software comparison, simca-p, spad, sas, r software, pls package
Components: PLSR, VIEW DATASET, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL
Tutorial: en_Tanagra_PLSR_Software_Comparison.pdf
Dataset: cars_pls_regression.xls
References :
M. Tenenhaus, « La régression PLS – Théorie et pratique », Technip, 1998.
D. Garson, « Partial Least Squares Regression », from Statnotes: Topics in Multivariate Analysis.
UMETRICS. SIMCA-P for Multivariate Data Analysis.

Saturday, August 6, 2011

The CART method under Tanagra and R (rpart)

CART (Breiman and al., 1984) is a very popular classification tree (says also decision tree) learning algorithm. Rightly. CART incorporates all the ingredients of a good learning control: the post-pruning process enables to make the trade-off between the bias and the variance; the cost complexity mechanism enables to "smooth" the exploration of the space of solutions; we can control the preference for simplicity with the standard error rule (SE-rule); etc. Thus, the data miner can adjust the settings according to the goal of the study and the data characteristics.

The Breiman's algorithm is provided under different designations in the free data mining tools. Tanagra uses the "C-RT" name. R, through a specific package , provides the "rpart" function.

In this tutorial, we describe these implementations of the CART approach according to the original book (Breiman and al., 1984; chapters 3, 10 and 11). The main difference between them is the implementation of the post-pruning process. Tanagra uses a specific sample says "pruning set" (section 11.4); when rpart is based on the cross-validation principle (section 11.5) .

Keywords: decision tree, classification tree, recursive partitioning, cart, R software, rpart package
Components: DISCRETE SELECT EXAMPLES, C-RT, SUPERVISED LEARNING, TEST
Tutorial: en_Tanagra_R_CART_algorithm.pdf
Dataset: wave5300.xls
References:
Breiman, J. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Chapman & Hall, 1984.
"The R project for Statistical Computing" - http://www.r-project.org/

Sunday, July 24, 2011

PLS Discriminant Analysis - A comparative study

PLS regression is a regression technique usually designed to predict the values taken by a group of Y variables (target variables, dependent variables) from a set of variables X (descriptors, independent variables). Initially defined for the prediction of continuous target variable, the PLS regression can be adapted to the prediction of one discrete variable - i.e. adapted to the supervised learning framework - in different ways . The approah is called "PLS Discriminant Analysis" in this context. It incorporates the valuable qualities that we know usually into this new framework: the ability to process a representation space with very high dimensionality, a large number of noisy and / or redundant descriptors.

This tutorial is the continuation of a precedent paper dedicated to the presentation of some variants of the PLS-DA. We describe the behavior of one of them (PLS-LDA - PLS Linear Discriminant Analysis) on a learning set where the number of descriptors is moderately high (278 descriptors) in relation to the number of instances (232 instances). Even if the number of descriptors is not really very high, we note in our experiment a valuable characteristic of the PLS approach: we can control the variance of the classifier by adjusting the number of latent variables.

To assess this idea, we compare the behavior of the PLS-LDA with state-of-the-art supervised learning methods such as K-nearest neighbors , SVM (Support Vector Machine from the LIBSVM library ), the Breiman's Random Forest approach , or the Fisher's Linear Discriminant Analysis .

Keywords: pls regression, linear discriminant analysis, supervised learning, support vector machine, SVM, random forest, nearest neighbor
Components: K-NN, PLS-LDA, BAGGING, RND TREE, C-SVC, TEST, DISCRETE SELECT EXAMPLES, REMOVE CONSTANT
Tutorial: en_Tanagra_PLS_DA_Comparaison.pdf
Dataset: arrhytmia.bdm
References :
S. Chevallier, D. Bertrand, A. Kohler, P. Courcoux, « Application of PLS-DA in multivariate image analysis », in J. Chemometrics, 20 : 221-229, 2006.
Garson, « Partial Least Squares Regression (PLS) », http://www2.chass.ncsu.edu/garson/PA765/pls.htm

Sunday, July 17, 2011

Tanagra add-on for OpenOffice Calc 3.3

Tanagra add-on for OpenOffice 3.3 and LibreOffice 3.4.

The connection with spreadsheet applications is certainly a factor of success for Tanagra. It is easy to manipulate a dataset into OpenOffice Calc (up to version 3.2) and send it to Tanagra using the TanagraLibrary.zip extension for further analysis .

Recently, users have reported to me that the mechanism did not work with recent versions of OpenOffice (version 3.3) and LibreOffice (version 3.4). I realized that, rather than a correction, it was more appropriate to elaborate a new module which meets the standard for managing extensions of these tools. The new library "TanagraModule.oxt" is now incorporated into the distribution.

This tutorial describes how to install and to use this add-on under OpenOffice Calc 3.0. The adaptation to LibreOffice 3.4 is very easy.

Keywords : data importation, spreadsheet application, openoffice, libreoffice, add-in, add-on, excel
Component : View Dataset
Tutorial : en_Tanagra_Addon_OpenOffice_LibreOffice.pdf
Dataset : breast.ods
Références :
Tutoriel Tanagra, "OOo Calc file handling using an add-in"
Tutoriel Tanagra, "Launching Tanagra from OOo Calc under Linux"

Tuesday, July 5, 2011

Tanagra - Version 1.4.40

Few improvements for this new version.

A new addon for the connection between Tanagra and the recent version of OpenOffice Calc spreadsheet has been created. The old one did not work for recent versions - OpenOffice 3.3 and LibreOffice 3.4. During the installation process, another library was added ("TanagraModule.oxt") to not interfere with the old, still functional for previous versions of Open Office (3.2 and earlier). A tutorial describing its installation and its utilization will be put online soon. I take this opportunity to highlight again how a privileged connection between a spreadsheet and a specialized tool for Data Mining is convenient. The annual poll organized by the kdnuggets.com website shows the interest of this connection (2011, 2010, 2009,...). We note that there is a similar addon for the R software (R4Calc). This change was suggested by Jérémy Roos (OpenOffice) and Franck Thomas (LibreOffice).

The non-standardized ACP is now available. It is possible to implement unchecking the option of standardization of the data in the Principal Component Analysis component. Change suggested by Elvire Antanjan.

Simultaneous regression was introduced. It is very similar to the method programmed into LazStats, which is unfortunately more accessible freely now. The approach is described in a free booklet online "Practice of linear regression analysis" (in French) (section 3.6).

The color codes according to the p-value have been introduced for the Linear Correlation component. Change suggested by Samuel KL.

Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.

Donwload page : setup

Thursday, May 26, 2011

Tanagra - Version 1.4.39

Some minor corrections for the Tanagra 1.4.39 version.

For the PCA (principal component analysis) component, when we ask all the factors, none are generated. Reported by Jérémy Roos.

In the previous 1.4.38 version, the results of Multinomial Logistic Regression are not consistent with the tutorial on the website. The calculations are wrong. Reported by Nicole Jurado.

It is now possible to obtain the scores from the PLS-DA component (Partial Least Squares Regression - Discriminant Analysis). Reported by Carlos Serrano.

All these bugs are corrected in the 1.4.39 version. Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.

Donwload page : setup

Tuesday, April 5, 2011

Mining Association Rule from Transactions File

Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain. But in fact, it can be implemented in various areas where we want to discover the associations between variables. The association is described by a "IF THEN" rule. The IF part is called "antecedent" of the rule; the THEN part correspond to the "consequent" e.g. IF onions AND potatoes THAN burger (http://en.wikipedia.org/wiki/Association_rule_learning) i.e. if a customer buys onions and potatoes then he buys also burger.

It is possible to find co-occurrences in the standard attribute - value tables that are handled with the most of the data mining tools. In this context, the rows correspond to the baskets (transactions); the columns correspond to the list of all possible products (items); at the intersection of the row and the column, we have an indicator (true/false or 1/0) which indicates if the item belongs to the transaction. But this kind of representation is too naive. A few products are incorporated in each basket. Each row of the table contains a few 1 and many 0. The size of the data file is unnecessarily excessive. Therefore, another data representation, says "transactions file", is often used to minimize the data file size. In this tutorial, we treat a special case of the transactions file. The principle is based on the enumeration of the items included in each transaction. But in our case, we have only two values for each row of the data file: the transaction identifier, and the item identifier. Thus, each transaction can be listed on several rows of the data file.

This data representation is quite natural considering the problem we want to treat. It also has the advantage of being more compact since only the items really present in each transaction are enumerated. However, it appears that many tools do not know manage directly this kind of data representation. We observe curiously a distinction between professional tools and the academic ones. The first ones can handle directly this kind of data file without special data preparation. This is the case of SPAD 7.3 and SAS Enterprise Miner 4.3 that we study in this tutorial. On the other hand, the academic tools need a data transformation, prior the importation of the dataset. We use a small program written in VBA (Visual Basic for Applications) under Excel to prepare the dataset. Thereafter, we perform the analysis with Tanagra 1.4.37 and Knime 2.2.2 (Note: a reader told me that we can transform the dataset with Knime without the utilization of external program. This is true. I will describe this approach in a separate section at the end of this tutorial).

Attention, we must respect the original specifications i.e. focus only on rules indicating the simultaneous presence of items in transactions. We must not, consecutively to a bad "presence - absence" coding scheme, to generate rules outlining the simultaneous absence of some items. This may be interesting in some cases may be, but this is not the purpose of our analysis.

Keywords: association rules, a priori algorithm
Components: A priori
Tutorial: en_Tanagra_Assoc_Rule_Transactions.pdf
Dataset: assoc_rule_transactions.zip
References:
Tanagra Tutorials, "Association rule learning from transaction dataset"
P.N. Tan, M. Steinbach, V. Kumar, « Introduction to Data Mining », Addison Wesley, 2006 ; chapitre 6, « Association analysis : Basic Concepts and Algorithms ».
Wikipedia - "Association rule learning"

Sunday, February 20, 2011

Multiple Regression - Reading the results

The aim of the multiple regression is to predict the values of a continuous dependent variable Y from a set of continuous or binary independent variables (X1,..., Xp).

In this tutorial, we want to model the relationship between the cars consumption and their weight, engine-size and horsepower. We describe the outputs of Tanagra by associating them with the used formulas. We highlight the importance of the unscaled covariance matrix of the estimated coefficients [(X'X)-1] (Tanagra 1.4.38 and later). It is used for the subsequent analysis: individual significance of coefficients, simultaneous significance of several coefficients, testing linear combinations of coefficients, computation of the standard error for the prediction interval. These analyses are performed into the Excel spreadsheet.

Thereafter, we perform the same analyses with the R software. We identify the objects provided by the lm(.) procedure that we can use in the same context.

Keywords: linear regression, multiple regression, R software, lm, summary.lm, testing significance, prediction interval
Components: MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_Multiple_Regression_Results.pdf
Dataset: cars_consumption.zip
References :
D. Garson, "Multiple Regression"

Friday, February 4, 2011

Tanagra - Version 1.4.38

Some minor corrections for the Tanagra 1.4.38 version.

The color codes for the normality tests have been harmonized (Normality Test). In some configurations, the colors associated with p-values were not consistent, it could misleading the users. This problem has been reported by Lawrence M. Garmendia.

Following indications from Mr. Oanh Chau, I realized that the standardization of variables to the HAC (hierarchical agglomerative clustering) was based on the sample standard deviation. This is not an error in itself. But the sum of index of level into the dendrogram does not consistent with the TSS (total sum of squares). This is unwelcome. The difference is especially noticeable on small dataset, it disappears when the dataset size increases. The correction has been introduced. Now the BSS ratio is equal to 1 when we have the trivial partition i.e. one individual per group.

Multiple linear regression (MULTIPLE LINEAR REGRESSION) displays the matrix (X'X) ^ (-1). It allows to deduce the variance covariance matrix of coefficients (by multiplying the matrix by the estimated variance of the error). It can be also used in the generalized tests for the model coefficients.

Last, the outputs of the descriptive discriminant analysis (CANONICAL DISCRIMINANT ANALYSIS) were improved. The group centroids (Group centroids) on the factorial axes are directly provided.

Thank you very much to all those who help me to improve this work by their comments or suggestions.

Download page: setup

Tuesday, January 4, 2011

Tanagra website statistics for 2010

The year 2010 ends, 2011 begins. I wish you all a very happy year 2011.

A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 241,765 times this year, 662 visits per day. For comparison, we had 520 daily visits in 2009 and 349 in 2008.

Who are you? The majority of visits come from France and Maghreb (62%). Then there are a large part of French speaking countries. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, Brazil,...

Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Mining: course materials, tutorials, links to other documents available on line, etc.. This is hardly surprising. I take more time myself to write booklets and tutorials, to study the behavior of different software, of which Tanagra.

Happy New Year 2011 to all.

Ricco.
Slideshow: Website statistics for 2010