Usually, as a factor analysis approach, we use the principal component analysis (PCA) when the active variables are quantitative; the multiple correspondence analysis (MCA) when they are all categorical. But what to do when we have a mix of these two types of variables?
A possible strategy is to discretize the quantitative variables and use the MCA. But this procedure is not recommended if we have a small dataset (a few number of instances), or if the number of qualitative variables is low in comparison with the number of quantitative ones. In addition, the discretization implies a loss of information. The choice of the number of intervals and the calculation of the cut points are not obvious.
Another possible strategy is to replace each qualitative variable by a set of dummy variables (a 0/1 indicator for each category of the variable to recode). Then we use the PCA. This strategy has a drawback. Indeed, because the dispersions of the variables (the quantitative variables and the indicator variables) are not comparable, we will obtain biased results.
The Jérôme Pages' "Multiple Factor Analysis for Mixed Data" (2004) [AFDM in French] relies on this second idea. But it introduces an additional refinement. It uses dummy variables, but instead of the 0/1, it uses the 0/x values, where 'x' is computed from the frequency of the concerned category of the qualitative variable. We can therefore use a standard program for PCA to lead the analysis (Pages, 2004; page 102). The calculation process is thus well controlled. But the interpretation of the results requires a little extra effort since it will be different depending on whether we study the role of a quantitative or qualitative variable.
In this tutorial, we show how to perform an AFDM with Tanagra 1.4.46 and R 1.15.1 (FactoMinerR package). We emphasize the reading of the results. We must study simultaneously the influence of quantitative and qualitative variables for the interpretation of the factors.
Keywords: PCA, principal component analysis, MCA, multiple correspondence analysis, AFDM, correlation, correlation ratio, FactoMineR package, R osftware
Components: AFDM, SCATTERPLOT WITH LABEL, CORRELATION SCATTERPLOT, VIEW MULTIPLE SCATTERPLOT
Tutorial: en_Tanagra_AFDM.pdf
Dataset: AUTOS2005AFDM.txt
References :
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.
Sunday, March 31, 2013
Saturday, March 2, 2013
Correspondence Analysis - Tools comparison
The correspondence analysis (or factorial correspondence analysis) is an exploratory technique which enables to detect the salient associations in a two-way contingency table. It proposes an attractive graphical display where the rows and the columns of the table are depicted as points. Thus, we can visually identify the similarities and the differences between the rows profiles (between the columns profiles). We can also detect the associations between rows and columns.
The correspondence analysis (CA) can be viewed as an approach to decompose the chi-squared statistic associated with a two-way contingency table into orthogonal factors. In fact, because CA is a descriptive technique, it can be applied to tables even if the chi-square test of independence is not appropriate. The only restriction is that the table must contain positive or zero values, the calculating the sum of the rows and the columns is possible, the rows and columns profiles can be interpreted.
The correspondence analysis can be viewed as a factorial technique. Factors are latent variables defined from linear combinations of the rows profiles (or columns profiles). We can use the factors scores coefficients to calculate the coordinate of supplementary rows or columns.
In this tutorial, we show how to implement the CA on a realistic dataset with various tools: Tanagra 1.4.48, which incorporates new features for a better reading of the results; R software, using the "ca" and "ade4" packages; OpenStat; and SAS (PROC CORRESP). We will see - as always - that all these software produce exactly the same numerical results (fortunately!). The differences are found mainly in terms of the organization of the outputs.
Keywords: correspondence analysis, symmetric graph, R software, package ca, package ade4, openstat, sas
Components: CORRESPONDENCE ANALYSIS
Tutorial: en_Tanagra_Correspondence_Analysis.pdf
Dataset: statements_foods.zip
References :
M. Bendixen, « A practical guide to the use of the correspondence analysis in marketing research », Marketing Research On-Line, 1 (1), pp. 16-38, 1996.
Tanagra Tutorial, "Correspondence Analysis".
The correspondence analysis (CA) can be viewed as an approach to decompose the chi-squared statistic associated with a two-way contingency table into orthogonal factors. In fact, because CA is a descriptive technique, it can be applied to tables even if the chi-square test of independence is not appropriate. The only restriction is that the table must contain positive or zero values, the calculating the sum of the rows and the columns is possible, the rows and columns profiles can be interpreted.
The correspondence analysis can be viewed as a factorial technique. Factors are latent variables defined from linear combinations of the rows profiles (or columns profiles). We can use the factors scores coefficients to calculate the coordinate of supplementary rows or columns.
In this tutorial, we show how to implement the CA on a realistic dataset with various tools: Tanagra 1.4.48, which incorporates new features for a better reading of the results; R software, using the "ca" and "ade4" packages; OpenStat; and SAS (PROC CORRESP). We will see - as always - that all these software produce exactly the same numerical results (fortunately!). The differences are found mainly in terms of the organization of the outputs.
Keywords: correspondence analysis, symmetric graph, R software, package ca, package ade4, openstat, sas
Components: CORRESPONDENCE ANALYSIS
Tutorial: en_Tanagra_Correspondence_Analysis.pdf
Dataset: statements_foods.zip
References :
M. Bendixen, « A practical guide to the use of the correspondence analysis in marketing research », Marketing Research On-Line, 1 (1), pp. 16-38, 1996.
Tanagra Tutorial, "Correspondence Analysis".
Tuesday, February 5, 2013
Exploratory Factor Analysis
PCA (Principal Component Analysis) is a dimension reduction technique which enables to obtain a synthetic description of a set of quantitative variables. It produces latent variables called principal components (or factors) which are linear combinations of the original variables. The number of useful components is much lower than to the number of original variables because these last ones are (more or less) correlated. PCA enables also to reveal the internal structure of the data because the components are constructed in a manner as to explain optimally the variance of the data.
PFA (Principal Factor Analysis) is often confused with PCA. There has been significant controversy about the equivalence or otherwise of the two techniques. One of the point of view which enables to distinguish them is to consider that the factors from the PCA account the maximal amount of variance of the available variables, while those from PFA account only the common variance in the data. The latter seems more appropriate if the goal of the analysis is to produce latent variables which highlight the underlying relation between the original variables. The influence of the variables which are not related to the other should be excluded.
They are thus different due to the nature of the information they make use. But the nuance is not obvious. Especially as they are often grouped in the same tool into some popular software (e.g. “PROC FACTOR” into SAS; “ANALYZE / DATA REDUCTION / FACTOR” into SPSS; etc.). In addition, their outputs and their interpretation are very similar.
In this tutorial, we present three approaches: Principal Component Analysis – PCA; non iterative Principal Factor Analysis - PFA; non iterative Harris Component Analysis - Harris. We highlight the differences by comparing the matrix (correlation matrix for the PCA) used for the diagonalization process. We detail the steps of the calculations using a program for R. We check our results by comparing them to those of SAS (PROC FACTOR). Thereafter, we implement these methods with Tanagra, with R using the PSYCH package, and with SPSS.
Keywords: PCA, principal component analysis, correlation matrix, principal factor analysis, harris, reproduced correlation, residual correlation, partial correlation, varimax rotation, R software, psych package, principal( ), fa( ), proc factor, SAS, SPSS
Components: PRINCIPAL COMPONENT ANALYSIS, PRINCIPAL FACTOR ANALYSYS, HARRIS COMPONENT ANALYSIS, FACTOR ROTATION
Tutorial: en_Tanagra_Principal_Factor_Analysis.pdf
Datasets: beer_rnd.zip
References:
D. Suhr, "Principal Component Analysis vs. Exploratory Factor Analysis".
Wikipedia, "Factor Analysis".
PFA (Principal Factor Analysis) is often confused with PCA. There has been significant controversy about the equivalence or otherwise of the two techniques. One of the point of view which enables to distinguish them is to consider that the factors from the PCA account the maximal amount of variance of the available variables, while those from PFA account only the common variance in the data. The latter seems more appropriate if the goal of the analysis is to produce latent variables which highlight the underlying relation between the original variables. The influence of the variables which are not related to the other should be excluded.
They are thus different due to the nature of the information they make use. But the nuance is not obvious. Especially as they are often grouped in the same tool into some popular software (e.g. “PROC FACTOR” into SAS; “ANALYZE / DATA REDUCTION / FACTOR” into SPSS; etc.). In addition, their outputs and their interpretation are very similar.
In this tutorial, we present three approaches: Principal Component Analysis – PCA; non iterative Principal Factor Analysis - PFA; non iterative Harris Component Analysis - Harris. We highlight the differences by comparing the matrix (correlation matrix for the PCA) used for the diagonalization process. We detail the steps of the calculations using a program for R. We check our results by comparing them to those of SAS (PROC FACTOR). Thereafter, we implement these methods with Tanagra, with R using the PSYCH package, and with SPSS.
Keywords: PCA, principal component analysis, correlation matrix, principal factor analysis, harris, reproduced correlation, residual correlation, partial correlation, varimax rotation, R software, psych package, principal( ), fa( ), proc factor, SAS, SPSS
Components: PRINCIPAL COMPONENT ANALYSIS, PRINCIPAL FACTOR ANALYSYS, HARRIS COMPONENT ANALYSIS, FACTOR ROTATION
Tutorial: en_Tanagra_Principal_Factor_Analysis.pdf
Datasets: beer_rnd.zip
References:
D. Suhr, "Principal Component Analysis vs. Exploratory Factor Analysis".
Wikipedia, "Factor Analysis".
Friday, January 18, 2013
New features for PCA in Tanagra
Principal Component Analysis (PCA) is a very popular dimension reduction technique. The aim is to produce a few number of factors which summarizes as better as possible the amount of information in the data. The factors are linear combinations of the original variables. From a certain point a view, PCA can be seen as a compression technique.
The determination of the appropriate number of factors is a difficult problem in PCA. Various approaches are possible, it does not really exist a state-of-art method. The only way to proceed is to try different approaches in order to obtain a clear indication about the good solution. We had shown how to program them under R in a recent paper . These techniques are now incorporated into Tanagra 1.4.45. We have also added the KMO index (Measure of Sampling Adequacy – MSA) and the Bartlett's test of sphericity in the Principal Component Analysis tool.
In this tutorial, we present these new features incorporated into Tanagra on a realistic example. To check our implementation, we compare our results with those of SAS PROC FACTOR when the equivalent is available.
Keywords: principal component analysis, pca, sas, proc princomp, proc factor, bartlett's test of sphericity, R software, scree plot, cattell, kaiser-guttman, karlis saporta spinaki, broken stick approach, parallel analysis, randomization, bootstrap, correlation, partial correlation, varimax, factor rotation, variable clustering, msa, kmo index, correlation circle
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT, PARALLEL ANALYSIS, BOOTSTRAP EIGENVALUES, FACTOR ROTATION, SCATTERPLOT, VARHCA
Tutorial: en_Tanagra_PCA_New_Tools.pdf
Dataset : beer_pca.xls
References:
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"
Tanagra - "Choosing the number of components in PCA"
The determination of the appropriate number of factors is a difficult problem in PCA. Various approaches are possible, it does not really exist a state-of-art method. The only way to proceed is to try different approaches in order to obtain a clear indication about the good solution. We had shown how to program them under R in a recent paper . These techniques are now incorporated into Tanagra 1.4.45. We have also added the KMO index (Measure of Sampling Adequacy – MSA) and the Bartlett's test of sphericity in the Principal Component Analysis tool.
In this tutorial, we present these new features incorporated into Tanagra on a realistic example. To check our implementation, we compare our results with those of SAS PROC FACTOR when the equivalent is available.
Keywords: principal component analysis, pca, sas, proc princomp, proc factor, bartlett's test of sphericity, R software, scree plot, cattell, kaiser-guttman, karlis saporta spinaki, broken stick approach, parallel analysis, randomization, bootstrap, correlation, partial correlation, varimax, factor rotation, variable clustering, msa, kmo index, correlation circle
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT, PARALLEL ANALYSIS, BOOTSTRAP EIGENVALUES, FACTOR ROTATION, SCATTERPLOT, VARHCA
Tutorial: en_Tanagra_PCA_New_Tools.pdf
Dataset : beer_pca.xls
References:
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"
Tanagra - "Choosing the number of components in PCA"
Saturday, January 12, 2013
Choosing the number of components in PCA
Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors (or components) are linear combinations of the original variables.
Choosing the right number of factors is a crucial problem in PCA. If we select too much factors, we include noise from the sampling fluctuations in the analysis. If we choose too few factors, we lose relevant information, the analysis is incomplete. Unfortunately, there is not an indisputable approach for the determination of the number of factors. As a rule of thumb, we must select only the interpretable factors, knowing that the choice depends heavily on the domain expertise. And yet, this last one is not always available. We intend precisely to build on the data analysis to get a better knowledge on the studied domain.
In this tutorial, we present various approaches for the determination of the right number of factors for PCA based on the correlation matrix. Some of them, such as the Kaiser-Gutman rule or the scree plot method, are very popular even if they are not really statistically sound; others seems more rigorous, but seldom if ever used because they are not available in the popular statistical software suite.
In a first time, we use Tanagra and the Excel spreadsheet for the implementation of some methods; in a second time, especially for the resampling based approaches, we write programs for R from the results of the princomp() procedure.
Keywords: principal component analysis, factor analysis, pca, princomp, R software, bartlett's test of sphericity, xlsx package, scree plot, kaiser-guttman rule, broken-stick method, parallel analysis, randomization, bootstrap, correlation, partial correlation
Components: PRINCIPAL COMPONENT ANALYSIS, LINEAR CORRELATION, PARTIAL CORRELATION
Tutorial: en_Tanagra_Nb_Components_PCA.pdf
Dataset: crime_dataset_pca.zip
References :
D. Jackson, “Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches”, in Ecology, 74(8), pp. 2204-2214, 1993.
P. Neto, D. Jackson, K. Somers, “How Many Principal Components? Stopping Rules for Determining the Number of non-trivial Axes Revisited”, in Computational Statistics & Data Analysis, 49(2005), pp. 974-997, 2004.
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"
Choosing the right number of factors is a crucial problem in PCA. If we select too much factors, we include noise from the sampling fluctuations in the analysis. If we choose too few factors, we lose relevant information, the analysis is incomplete. Unfortunately, there is not an indisputable approach for the determination of the number of factors. As a rule of thumb, we must select only the interpretable factors, knowing that the choice depends heavily on the domain expertise. And yet, this last one is not always available. We intend precisely to build on the data analysis to get a better knowledge on the studied domain.
In this tutorial, we present various approaches for the determination of the right number of factors for PCA based on the correlation matrix. Some of them, such as the Kaiser-Gutman rule or the scree plot method, are very popular even if they are not really statistically sound; others seems more rigorous, but seldom if ever used because they are not available in the popular statistical software suite.
In a first time, we use Tanagra and the Excel spreadsheet for the implementation of some methods; in a second time, especially for the resampling based approaches, we write programs for R from the results of the princomp() procedure.
Keywords: principal component analysis, factor analysis, pca, princomp, R software, bartlett's test of sphericity, xlsx package, scree plot, kaiser-guttman rule, broken-stick method, parallel analysis, randomization, bootstrap, correlation, partial correlation
Components: PRINCIPAL COMPONENT ANALYSIS, LINEAR CORRELATION, PARTIAL CORRELATION
Tutorial: en_Tanagra_Nb_Components_PCA.pdf
Dataset: crime_dataset_pca.zip
References :
D. Jackson, “Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches”, in Ecology, 74(8), pp. 2204-2214, 1993.
P. Neto, D. Jackson, K. Somers, “How Many Principal Components? Stopping Rules for Determining the Number of non-trivial Axes Revisited”, in Computational Statistics & Data Analysis, 49(2005), pp. 974-997, 2004.
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"
Monday, January 7, 2013
PCA using R - KMO index and Bartlett's test
Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors are linear combinations of the original variables. The approach can handle only quantitative variables.
We have presented the PCA in previous tutorials. In this paper, we describe in details two indicators used for the checking of the interest of the implementation of the PCA on a dataset: the Bartlett's sphericity test and the KMO index. They are directly available in some commercial tools (e.g. SAS or SPSS). Here, we describe the formulas and we show how to program them under R. We compare the obtained results with those of SAS on a dataset.
Keywords: principal component analysis, pca, spss, sas, proc factor, princomp, kmo index, msa, measure of sampling adequacy, bartlett's sphericity test, xlsx package, psych package, R software
Components: VARHCA, PRINCIPAL COMPONENT ANALYSIS
Tutorial: en_Tanagra_KMO_Bartlett.pdf
Dataset: socioeconomics.zip
Références :
Tutoriel Tanagra - "Principal Component Analysis (PCA)"
Tutoriel Tanagra - "VARIMAX rotation in Principal Component Analysis"
SPSS - "Factor algorithms"
SAS - "The Factor procedure"
We have presented the PCA in previous tutorials. In this paper, we describe in details two indicators used for the checking of the interest of the implementation of the PCA on a dataset: the Bartlett's sphericity test and the KMO index. They are directly available in some commercial tools (e.g. SAS or SPSS). Here, we describe the formulas and we show how to program them under R. We compare the obtained results with those of SAS on a dataset.
Keywords: principal component analysis, pca, spss, sas, proc factor, princomp, kmo index, msa, measure of sampling adequacy, bartlett's sphericity test, xlsx package, psych package, R software
Components: VARHCA, PRINCIPAL COMPONENT ANALYSIS
Tutorial: en_Tanagra_KMO_Bartlett.pdf
Dataset: socioeconomics.zip
Références :
Tutoriel Tanagra - "Principal Component Analysis (PCA)"
Tutoriel Tanagra - "VARIMAX rotation in Principal Component Analysis"
SPSS - "Factor algorithms"
SAS - "The Factor procedure"
Sunday, December 30, 2012
Discriminant Correspondence Analysis
The aim of the canonical discriminant analysis is to explain the belonging to pre-defined groups of instances of a dataset. The groups are specified by a dependent categorical variable (class attribute, response variable); the explanatory variables (descriptors, predictors, independent variables) are all continuous. So, we obtain a small number of latent variables which enable to distinguish as far as possible the groups. These new features, called factors, are linear combinations of the initial descriptors. The process is a valuable dimensionality reduction technique. But its main drawback is that it cannot be directly applied when the descriptors are discrete. Even if the calculations are possible if we recode the variables using dummy variables for instance, the interpretation of the results - which is one of the main goals of the canonical discriminant analysis - is not really obvious.
In this tutorial, we present a variant of the discriminant analysis which is applicable to discrete descriptors due to Hervé Abdi (2007) . The approach is based on a transformation of the raw dataset in a kind of contingency table. The rows of the table correspond to the values of the target attribute; the columns are the indicators associated to the predictors’ values. Thus, the author suggests to use a correspondence analysis, on the one hand, in order to distinguish the groups, and on the other hand, to detect the relevant relationships between the values of the target attribute and those of the explanatory variables. The author called its approach "discriminant correspondence analysis" because it uses a correspondence analysis framework to solve a discriminant analysis problem.
In what follows, we detail the use of the discriminant correspondence analysis with Tanagra 1.4.48. We use the example described in the Hervé Abdi's paper. The goal is to explain the origin of 12 wines (3 possible regions) using 5 descriptors related to characteristics assessed by professional tasters. In a second part (section 3), we reproduce all the calculations with a program written for R.
Keywords: canonical discriminant analysis, descriptive discriminant analysis, correspondence analysis, R software, xlsx package, ca package
Components: DISCRIMINANT CORRESPONDENCE ANALYSIS
Tutorial : Tutorial DCA
Dataset: french_wine_dca.zip
References:
H. Abdi, « Discriminant correspondence analysis », In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage. pp. 270-275, 2007.
In this tutorial, we present a variant of the discriminant analysis which is applicable to discrete descriptors due to Hervé Abdi (2007) . The approach is based on a transformation of the raw dataset in a kind of contingency table. The rows of the table correspond to the values of the target attribute; the columns are the indicators associated to the predictors’ values. Thus, the author suggests to use a correspondence analysis, on the one hand, in order to distinguish the groups, and on the other hand, to detect the relevant relationships between the values of the target attribute and those of the explanatory variables. The author called its approach "discriminant correspondence analysis" because it uses a correspondence analysis framework to solve a discriminant analysis problem.
In what follows, we detail the use of the discriminant correspondence analysis with Tanagra 1.4.48. We use the example described in the Hervé Abdi's paper. The goal is to explain the origin of 12 wines (3 possible regions) using 5 descriptors related to characteristics assessed by professional tasters. In a second part (section 3), we reproduce all the calculations with a program written for R.
Keywords: canonical discriminant analysis, descriptive discriminant analysis, correspondence analysis, R software, xlsx package, ca package
Components: DISCRIMINANT CORRESPONDENCE ANALYSIS
Tutorial : Tutorial DCA
Dataset: french_wine_dca.zip
References:
H. Abdi, « Discriminant correspondence analysis », In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage. pp. 270-275, 2007.
Subscribe to:
Posts (Atom)