Tanagra - Data Mining and Data Science Tutorials: January 2013

Friday, January 18, 2013

New features for PCA in Tanagra

Principal Component Analysis (PCA) is a very popular dimension reduction technique. The aim is to produce a few number of factors which summarizes as better as possible the amount of information in the data. The factors are linear combinations of the original variables. From a certain point a view, PCA can be seen as a compression technique.

The determination of the appropriate number of factors is a difficult problem in PCA. Various approaches are possible, it does not really exist a state-of-art method. The only way to proceed is to try different approaches in order to obtain a clear indication about the good solution. We had shown how to program them under R in a recent paper . These techniques are now incorporated into Tanagra 1.4.45. We have also added the KMO index (Measure of Sampling Adequacy – MSA) and the Bartlett's test of sphericity in the Principal Component Analysis tool.

In this tutorial, we present these new features incorporated into Tanagra on a realistic example. To check our implementation, we compare our results with those of SAS PROC FACTOR when the equivalent is available.

Keywords: principal component analysis, pca, sas, proc princomp, proc factor, bartlett's test of sphericity, R software, scree plot, cattell, kaiser-guttman, karlis saporta spinaki, broken stick approach, parallel analysis, randomization, bootstrap, correlation, partial correlation, varimax, factor rotation, variable clustering, msa, kmo index, correlation circle
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT, PARALLEL ANALYSIS, BOOTSTRAP EIGENVALUES, FACTOR ROTATION, SCATTERPLOT, VARHCA
Tutorial: en_Tanagra_PCA_New_Tools.pdf
Dataset : beer_pca.xls
References:
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"
Tanagra - "Choosing the number of components in PCA"

Saturday, January 12, 2013

Choosing the number of components in PCA

Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors (or components) are linear combinations of the original variables.

Choosing the right number of factors is a crucial problem in PCA. If we select too much factors, we include noise from the sampling fluctuations in the analysis. If we choose too few factors, we lose relevant information, the analysis is incomplete. Unfortunately, there is not an indisputable approach for the determination of the number of factors. As a rule of thumb, we must select only the interpretable factors, knowing that the choice depends heavily on the domain expertise. And yet, this last one is not always available. We intend precisely to build on the data analysis to get a better knowledge on the studied domain.

In this tutorial, we present various approaches for the determination of the right number of factors for PCA based on the correlation matrix. Some of them, such as the Kaiser-Gutman rule or the scree plot method, are very popular even if they are not really statistically sound; others seems more rigorous, but seldom if ever used because they are not available in the popular statistical software suite.

In a first time, we use Tanagra and the Excel spreadsheet for the implementation of some methods; in a second time, especially for the resampling based approaches, we write programs for R from the results of the princomp() procedure.

Keywords: principal component analysis, factor analysis, pca, princomp, R software, bartlett's test of sphericity, xlsx package, scree plot, kaiser-guttman rule, broken-stick method, parallel analysis, randomization, bootstrap, correlation, partial correlation
Components: PRINCIPAL COMPONENT ANALYSIS, LINEAR CORRELATION, PARTIAL CORRELATION
Tutorial: en_Tanagra_Nb_Components_PCA.pdf
Dataset: crime_dataset_pca.zip
References :
D. Jackson, “Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches”, in Ecology, 74(8), pp. 2204-2214, 1993.
P. Neto, D. Jackson, K. Somers, “How Many Principal Components? Stopping Rules for Determining the Number of non-trivial Axes Revisited”, in Computational Statistics & Data Analysis, 49(2005), pp. 974-997, 2004.
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"

Monday, January 7, 2013

PCA using R - KMO index and Bartlett's test

Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors are linear combinations of the original variables. The approach can handle only quantitative variables.

We have presented the PCA in previous tutorials. In this paper, we describe in details two indicators used for the checking of the interest of the implementation of the PCA on a dataset: the Bartlett's sphericity test and the KMO index. They are directly available in some commercial tools (e.g. SAS or SPSS). Here, we describe the formulas and we show how to program them under R. We compare the obtained results with those of SAS on a dataset.

Keywords: principal component analysis, pca, spss, sas, proc factor, princomp, kmo index, msa, measure of sampling adequacy, bartlett's sphericity test, xlsx package, psych package, R software
Components: VARHCA, PRINCIPAL COMPONENT ANALYSIS
Tutorial: en_Tanagra_KMO_Bartlett.pdf
Dataset: socioeconomics.zip
Références :
Tutoriel Tanagra - "Principal Component Analysis (PCA)"
Tutoriel Tanagra - "VARIMAX rotation in Principal Component Analysis"
SPSS - "Factor algorithms"
SAS - "The Factor procedure"