Tanagra - Data Mining and Data Science Tutorials: December 2009

Thursday, December 24, 2009

VARIMAX rotation in Principal Component Analysis

A VARIMAX rotation is a change of coordinates used in principal component analysis (PCA) that maximizes the sum of the variances of the squared loadings. Thus, all the coefficients (squared correlation with factors) will be either large or near zero, with few intermediate values.

The goal is to associate each variable to at most one factor. The interpretation of the results of the PCA will be simplified. Then each variable will be associated to one and one only factor, they are split (as much as possible) into disjoint sets.

In this tutorial, we show how to perform this kind of rotation from the results of a standard PCA in Tanagra.

Keywords: PCA, principal component analysis, VARIMAX, QUARTIMAX
Components : Principal Component Analysis, Factor Rotation
Tutorial: en_Tanagra_Pca_Varimax.pdf
Dataset: crime_dataset_from_DASL.xls
References:
Tanagra, "New features for PCA in Tanagra"
Tanagra, "Principal Component Analysis (PCA)"
Wikipedia, "Varimax rotation"
H. Abdi, "Factor rotations in Factor Analyses"

Sunday, December 20, 2009

Kruskal–Wallis one-way analysis of variance

The tests for comparison of population try to determine if K (K 2) samples come from the same underlying population according to a dependent variable (X). In other words, we try to determine if the underlying distribution of X is the same whatever the group.

We talk about non parametric tests when we do not make assumption about the shape of the distribution of the dependent variable. They are considered as being "distribution free" methods, at the opposite of the parametric approaches.

In this tutorial, we implement various tests for differences in location. The Kruskal-Wallis test is certainly the most used one when we try to determine if the scores among groups are stochastically the same. But other tests exist. We compare the results obtained. We will complete the analysis by conducting multiple comparisons in order to identify groups that differ significantly from each other.

Keywords: non parametric test, independent samples, Kruskal-Wallis, Van der Waerden, Fisher-Yates-Terry-Hoeffding, median test, tests for differences in location
Components: KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, FYTH 1-WAY ANOVA
Tutorial: en_Tanagra_Nonparametric_Test_KW_and_related.pdf
Dataset: wine_evaluation_nonparametric.xls
References:
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 14a. The Kruskal-Wallis Test for 3 or More Independent Samples.
Wikipedia. Kruskal–Wallis one-way analysis of variance.

Thursday, December 17, 2009

Tests for differences in scale

Parametric and non parametric tests for differences in scale.

The tests of equal variability (or dispersion, or scale, or simply variance) are often presented as a preliminary test before the comparison of means, in order to verify the homoscedasticity assumption. But this is not their only purpose. Compare dispersions can be an end in itself. For example, we wish to compare the performance of two systems of heating. The average temperature at the center of the room is the same; however one can wish to compare the mode of diffusion of heat in different parts of the room.

The parametric tests are based primarily on the Gaussian distribution. The test becomes a test for homogeneity of variance. We highlight the Levene test in this tutorial. Other tests exist (Bartlett test for instance), we mention them in this tutorial.

When the normality assumption is questionable, when sample size is low, when the variable is ordinal and not continuous, it is more appropriate to use non parametric tests. These are called tests for equality of scales or dispersions. In fact the procedures are not based on estimated variances. We will use well known techniques such as the Ansari-Bradley test, the Mood or the Klotz test. They have a scope broader since nonparametric. Some of these tests have a drawback, they are not applicable when the distributions conditionals do not share the same parameter of central tendency (the median in general, but we can adjust the values by centering in relation to the median).

In this tutorial, we show how to implement these various tests with Tanagra.

Keywords: parametric test, non parametric test, independent samples, Levene test, Bartlett test, Brown-Forsythe test, Mood test, Klotz test, Ansari-Bradley test
Components: LEVENE’S TEST, ANSARI-BRADLEY SCALE TEST, MOOD SCALE TEST, KLOTZ SCALE TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Scale_Differences.pdf
Dataset: tests_for_scale_differences.xls
References:
NIST, "Quantitative techniques", section 1.3.5 - http://www.itl.nist.gov/div898/handbook/eda/section3/eda35.htm

Wednesday, December 9, 2009

Outliers and influential points in regression

The analysis of outliers and influential points is an important step of the regression diagnostics. The goal is to detect (1) the points which are very different to the others (outliers) i.e. they seem do not belong to the analyzed population; or (2) the points that if they are removed (influential points), leads us to a different model. The distinction between these kinds of points is not always obvious.

In this tutorial, we implement several indicators for the analysis of outliers and influential points. To avoid confusion about the definitions of indicators (some indicators are calculated differently from one tool to another), we compare our results with state-of-the-art tool such as SAS and R. In a first step, we give the results described into the SAS documentation. In a second step, we describe the process and the results under Tanagra and R. In conclusion, we note that these tools give the same results.

Keywords: linear regression, outliers, influential points, standardized residuals, studentized residuals, leverage, dffits, cook's distance, covratio, dfbetas, R software
Components: Multiple linear regression, Outlier detection, DfBetas
Tutorial: en_Tanagra_Outlier_Influential_Points_for_Regression.pdf
Dataset: USPopulation.xls
References:
SAS STAT User’s Guide, « The REG Procedure – Predicted and Residual Values »

Monday, December 7, 2009

Tests for comparing two related samples

Dependent samples, also called related samples or correlated samples, occur when the response of the nth person in the second sample is partly a function of the response of the nth person in the first sample. There are several common forms of sample dependency . (1) Before-after and other studies in which the same people are surveyed at different points in time, including panel studies. (2) Matched-pairs studies in which each of the subjects of the study is paired with each of those in a comparison group on the basis matching factors (e.g. age, sex, income, etc.). (3) The pairs can simply be inherent in the situation we are trying to analyze. For instance, one tries to compare the time spent watching television by the man and woman within a couple. The blocks are naturally households. Men and women should not be considered as independent observations.

The aim of tests for related samples is to exclude from the analysis the within-group variation. The calculation of the differences is realized within each pair of subjects. In this tutorial, we show how to implement 3 tests for two related samples. Two of them are non-parametric (sign test and Wilcoxon matched-pairs ranks test), the last one is the parametric t-test for related samples.

Keywords: parametric test, non-parametric test, paired samples, sign test, wilcoxon signed rank test, paired samples t-test, normality test
Components: SIGN TEST, WILCOXON SIGNED RANK TEST, PAIRED T-TEST, FORMULA, NORMALITY TEST
Tutorial: en_Tanagra_Nonparametric_Test_for_Two_Related_Samples.pdf
Dataset : comparison_2_related_samples.xls
References :
R. Lowry, « Concepts and Applications of Inferential Statistics », SubChapter 12a. The Wilcoxon Signed-Rank Test.

Wednesday, December 2, 2009

Multivariate tests for comparing populations

Multivariate parametric hypothesis testing for comparing populations.

A multivariate test for comparison of population try to determine if K (K 2) samples come from the same underlying population according to a set of variables of interest (X1,…,Xp).

We talk about parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the data is drawn from a multivariate Gaussian distribution, the hypothesis testing relies on mean vector or on covariance matrix.

Keywords: Hotelling's T2, Wilks' Lambda, Box’s M test, Bartlett's test, mean vector, covariance matrix, MANOVA
Components: UNIVARIATE CONTINUOUS STAT, HOTELLING’S T2, HOTELLING’S T2 HETEROSCEDASTIC, BOX’S M TEST, ONE-WAY MANOVA
Lien: en_Tanagra_Multivariate_Parametric_Tests.pdf
Dataset: credit_approval.xls
References :
S. Rathburn, A. Wiesner, "STAT 505: Applied Multivariate Statistical Analysis", The Pennsylvania State University.