Tanagra - Data Mining and Data Science Tutorials: February 2011

Sunday, February 20, 2011

Multiple Regression - Reading the results

The aim of the multiple regression is to predict the values of a continuous dependent variable Y from a set of continuous or binary independent variables (X1,..., Xp).

In this tutorial, we want to model the relationship between the cars consumption and their weight, engine-size and horsepower. We describe the outputs of Tanagra by associating them with the used formulas. We highlight the importance of the unscaled covariance matrix of the estimated coefficients [(X'X)-1] (Tanagra 1.4.38 and later). It is used for the subsequent analysis: individual significance of coefficients, simultaneous significance of several coefficients, testing linear combinations of coefficients, computation of the standard error for the prediction interval. These analyses are performed into the Excel spreadsheet.

Thereafter, we perform the same analyses with the R software. We identify the objects provided by the lm(.) procedure that we can use in the same context.

Keywords: linear regression, multiple regression, R software, lm, summary.lm, testing significance, prediction interval
Components: MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_Multiple_Regression_Results.pdf
Dataset: cars_consumption.zip
References :
D. Garson, "Multiple Regression"

Friday, February 4, 2011

Tanagra - Version 1.4.38

Some minor corrections for the Tanagra 1.4.38 version.

The color codes for the normality tests have been harmonized (Normality Test). In some configurations, the colors associated with p-values were not consistent, it could misleading the users. This problem has been reported by Lawrence M. Garmendia.

Following indications from Mr. Oanh Chau, I realized that the standardization of variables to the HAC (hierarchical agglomerative clustering) was based on the sample standard deviation. This is not an error in itself. But the sum of index of level into the dendrogram does not consistent with the TSS (total sum of squares). This is unwelcome. The difference is especially noticeable on small dataset, it disappears when the dataset size increases. The correction has been introduced. Now the BSS ratio is equal to 1 when we have the trivial partition i.e. one individual per group.

Multiple linear regression (MULTIPLE LINEAR REGRESSION) displays the matrix (X'X) ^ (-1). It allows to deduce the variance covariance matrix of coefficients (by multiplying the matrix by the estimated variance of the error). It can be also used in the generalized tests for the model coefficients.

Last, the outputs of the descriptive discriminant analysis (CANONICAL DISCRIMINANT ANALYSIS) were improved. The group centroids (Group centroids) on the factorial axes are directly provided.

Thank you very much to all those who help me to improve this work by their comments or suggestions.

Download page: setup