Tanagra - Data Mining and Data Science Tutorials: 2013

Wednesday, December 18, 2013

Tanagra - Version 1.4.50

Improvements have been introduced, a new component is added.

HAC. Hierarchical agglomerative clustering. Computing time has been dramatically improved. We will detail the new procedure used in a new tutorial soon.

CATVARHAC. Classification of the levels of the nominal variables. The calculations are based on the work of Abdallah and Saporta (1998). The component performs an agglomerative hierarchical clustering of the levels of qualitative variables. The Dice's index is used as the distance measures. Several kind of linkage criteria are proposed: single linkage, complete linkage, average linkage, Ward's method. A tutorial will come to describe the method soon.

Download page : setup

Sunday, October 27, 2013

Parallel programming in R

Personal computers become more and more efficient. They are mostly equipped with multi-core processors. At the same time, most of the data mining tools, free or not, are often based on single-threaded calculations. Only one core is used during calculations, while others remain inactive.

Previously, we have introduced two multithreaded variants of linear discriminant analysis in Sipina 3.10 and 3.11. During the analysis that allowed me to develop the solutions introduced in Sipina, I had much studied parallelization mechanisms available in other Data Mining Tools. They are rather scarce. I noted that highly sophisticated strategies are proposed for the R software. These are often environments that enable to develop programs for multi-core processors machines, multiprocessor machines, and even for computer cluster. I studied in particular the "parallel" package which is itself derived from 'snow' and 'multicore' packages. Let us be quite clear. The library cannot miraculously accelerate an existing procedure. It gives us the opportunity to effectively use the machines resources by rearranging properly the calculations. Basically, the idea is to break down the process into tasks that can be run in parallel. When these tasks are completed, we perform the consolidation.

In this tutorial, we detail the parallelization of the calculation of the within-class covariance matrix under R 3.0.0. In a first step, we describe single-threaded approach, but easily convertible i.e. the basic tasks are easily identifiable. As a second step, we use the tools of “parallel” and “doParallel” packages to run elementary tasks on the available cores. We will then compare the processing time. We note that, unlike the toy examples available on the web, the results are mixed. The bottleneck is the managing of the data when we handle a large dataset.

Keywords: linear discriminant analysis, within-class covariance matrix, R software, parallel package, doparallel package, parLapply, mclapply, foreach
Didacticiel : en_Tanagra_Parallel_Programming_R.pdf
Fichiers : parallel.R, multithreaded_lda.zip
Références :
R-core, Package 'parallel', April 18, 2013.

Sunday, September 29, 2013

Load balanced multithreading for LDA

In a previous paper, we described a multithreading strategy for the linear discriminant analysis . The aim was to take advantage of the multicore processors of the recent computers. We noted that for the same memory occupation than the standard implementation, we can decrease dramatically the computation time according to the dataset characteristics. The solution however had two drawbacks: the number of cores used was dependent on the number of classes K of the dataset; the load of the cores depended on classes’ distributions. For instance, for one of dataset with K = 2 highly unbalanced classes, the gain was negligible compared to the single-threaded version.

In this paper, we present a new approach for the multithreaded implementation of the linear discriminant analysis, available in Sipina 3.11. It allows to overcome the two bottlenecks of the previous version. The capacity of the machine is fully used. More interesting, the number of used threads (cores) becomes customizable, allowing the user to adapt the machines resources used to process the database. But this is not without consideration. The memory occupation is increased. It depends on both the characteristics of the data and the number of cores that we want to use.

To evaluate the improvement introduced in this new version, we use various benchmark datasets to compare its computation time with those of the previous multithreaded approach, the single-threaded version, and the state-of-the-art proc discrim of SAS 9.3.

Keywords: sipina, multithreading, thread, multithreaded data mining, multithread processing, linear discriminant analysis, sas, proc discrim, R software, lda, MASS package, load balancing
Components: LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Sipina_LDA_Threads_Bis.pdf
Dataset: multithreaded_lda.zip
References:
S. Rathburn, A. Wiesner, S. Basu, "STAT 505: Applied Multivariate Statistical Analysis", Lesson 10: Discriminant Analysis, PennState, Online Learning: Department of Statistics.

Sunday, September 15, 2013

Tanagra - Version 1.4.49

Some enhancements regarding factor analysis approaches (PCA - principal component analysis, MCA - multiple correspondence analysis, CA - correspondence analysis, FDMA - factorial analysis of mixed data) have been incorporated. In particular, outputs have been completed.

The VARIMAX rotation has been improved. Thanks to Frédéric Glausinger for optimized source code.

The Benzecri correction is added in the MCA outputs. Thanks to Bernard Choffat for this suggestion.

Download page : setup

Thursday, May 30, 2013

Sipina - Version 3.11

A new multithreaded version of linear discriminant analysis is added to Sipina 3.11. Compared to the previous, it presents two assets: (1) it is able to use all the resources available on the machines with multi-core processors or multiprocessor; (2) the load balancing is better. It requires however more amount of memory, the internal structures of calculation are duplicated M times (M is the number of threads).

A tutorial will come to compare the behavior of this approach with the previous version and the single-threaded implementation.

Keywords: linear discriminant analysis, multihreaded implementation, multithread
Sipina website: Sipina
Download: Setup file

Wednesday, May 29, 2013

Multithreading for linear discriminant analysis

Most of the modern personal computers have multicore CPU. This increases considerably their processing capabilities. Unfortunately, the popular free data mining tools does not really incorporate the multithreaded processing in the data mining algorithms they provide, aside from particular case such as ensemble methods or cross-validation process. The main reason of this scarcity is that it is impossible to define a generic framework whatever the mining method. We must study carefully the sequential algorithm, detect the opportunity of multithreading, and reorganize the calculations. We deal with several constraints: we must not increase excessively the memory occupation, we must use all the available cores, and we must balance the loads on the threads. Of course, the solution must be simple and operational on the usual personal computers.

Previously, we implemented a solution for the decision tree induction in Sipina 3.5. We studied also the solutions incorporated in Knime and RapidMiner. We show that the multithreaded programs outperform the single-thread version. This is wholly natural. But we observed also that there is not a unique solution. The internal organization of the multithread calculations influences the behavior and the performance of the program . In this tutorial, we present a multithreaded implementation for the linear discriminant analysis in SIPINA 3.10. The main property of the solution is that the calculation structure requires the same amount of memory compared with the sequential program. We note that in some situations, the execution time can be decreased significantly.

The linear discriminant analysis is interesting in our context. We obtain a linear classifier which has a similar classification performance to the other linear method on the most of the real databases, especially compared with the logistic regression which is really popular (Saporta, 2006 – page 480; Hastie et al., 2013 – page 128). But the computation of the discriminant analysis is comparably really faster. We will see that this characteristic can be enhanced when we take advantage of the multicore architecture.

To better evaluate the improvements induced by our strategy, we compare our execution time with tools such as SAS 9.3 (proc discrim), R (lda of the MASS package) and Revolution R Community (an "optimized" version of R).

Keywords: sipina, multithreading, thread, multithreaded data mining, multithread processing, linear discriminant analysis, sas, proc discrim, R software, lda, MASS package
Components: LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Sipina_LDA_Threads.pdf
Dataset: multithreaded_lda.zip
References:
Tanagra, "Multithreading for decision tree induction".
S. Rathburn, A. Wiesner, S. Basu, "STAT 505: Applied Multivariate Statistical Analysis", Lesson 10: Discriminant Analysis, PennState, Online Learning: Department of Statistics.

Thursday, May 23, 2013

Sipina - Version 3.10

The linear discriminant analysis has been enhanced. All operations are performed in a single pass on the data.

A multithreaded version of the linear discriminant analysis has been added. It improves the execution speed by distributing calculations on any hearts (computer with a multi-core processor) or processors (multi-processor computer) available on the computer.

A tutorial will describe the behavior of these new implementations on some large databases.

Keywords: linear discriminant analysis, multihreaded implementation
Sipina website: Sipina
Download: Setup file

Sunday, March 31, 2013

Factor Analysis for Mixed Data

Usually, as a factor analysis approach, we use the principal component analysis (PCA) when the active variables are quantitative; the multiple correspondence analysis (MCA) when they are all categorical. But what to do when we have a mix of these two types of variables?

A possible strategy is to discretize the quantitative variables and use the MCA. But this procedure is not recommended if we have a small dataset (a few number of instances), or if the number of qualitative variables is low in comparison with the number of quantitative ones. In addition, the discretization implies a loss of information. The choice of the number of intervals and the calculation of the cut points are not obvious.
Another possible strategy is to replace each qualitative variable by a set of dummy variables (a 0/1 indicator for each category of the variable to recode). Then we use the PCA. This strategy has a drawback. Indeed, because the dispersions of the variables (the quantitative variables and the indicator variables) are not comparable, we will obtain biased results.

The Jérôme Pages' "Multiple Factor Analysis for Mixed Data" (2004) [AFDM in French] relies on this second idea. But it introduces an additional refinement. It uses dummy variables, but instead of the 0/1, it uses the 0/x values, where 'x' is computed from the frequency of the concerned category of the qualitative variable. We can therefore use a standard program for PCA to lead the analysis (Pages, 2004; page 102). The calculation process is thus well controlled. But the interpretation of the results requires a little extra effort since it will be different depending on whether we study the role of a quantitative or qualitative variable.

In this tutorial, we show how to perform an AFDM with Tanagra 1.4.46 and R 1.15.1 (FactoMinerR package). We emphasize the reading of the results. We must study simultaneously the influence of quantitative and qualitative variables for the interpretation of the factors.

Keywords: PCA, principal component analysis, MCA, multiple correspondence analysis, AFDM, correlation, correlation ratio, FactoMineR package, R osftware
Components: AFDM, SCATTERPLOT WITH LABEL, CORRELATION SCATTERPLOT, VIEW MULTIPLE SCATTERPLOT
Tutorial: en_Tanagra_AFDM.pdf
Dataset: AUTOS2005AFDM.txt
References :
Jerome Pages, « Analyse Factorielle de Données Mixtes », Revue de Statistique Appliquee, tome 52, n°4, 2004 ; pages 93-111.

Saturday, March 2, 2013

Correspondence Analysis - Tools comparison

The correspondence analysis (or factorial correspondence analysis) is an exploratory technique which enables to detect the salient associations in a two-way contingency table. It proposes an attractive graphical display where the rows and the columns of the table are depicted as points. Thus, we can visually identify the similarities and the differences between the rows profiles (between the columns profiles). We can also detect the associations between rows and columns.

The correspondence analysis (CA) can be viewed as an approach to decompose the chi-squared statistic associated with a two-way contingency table into orthogonal factors. In fact, because CA is a descriptive technique, it can be applied to tables even if the chi-square test of independence is not appropriate. The only restriction is that the table must contain positive or zero values, the calculating the sum of the rows and the columns is possible, the rows and columns profiles can be interpreted.

The correspondence analysis can be viewed as a factorial technique. Factors are latent variables defined from linear combinations of the rows profiles (or columns profiles). We can use the factors scores coefficients to calculate the coordinate of supplementary rows or columns.

In this tutorial, we show how to implement the CA on a realistic dataset with various tools: Tanagra 1.4.48, which incorporates new features for a better reading of the results; R software, using the "ca" and "ade4" packages; OpenStat; and SAS (PROC CORRESP). We will see - as always - that all these software produce exactly the same numerical results (fortunately!). The differences are found mainly in terms of the organization of the outputs.

Keywords: correspondence analysis, symmetric graph, R software, package ca, package ade4, openstat, sas
Components: CORRESPONDENCE ANALYSIS
Tutorial: en_Tanagra_Correspondence_Analysis.pdf
Dataset: statements_foods.zip
References :
M. Bendixen, « A practical guide to the use of the correspondence analysis in marketing research », Marketing Research On-Line, 1 (1), pp. 16-38, 1996.
Tanagra Tutorial, "Correspondence Analysis".

Tuesday, February 5, 2013

Exploratory Factor Analysis

PCA (Principal Component Analysis) is a dimension reduction technique which enables to obtain a synthetic description of a set of quantitative variables. It produces latent variables called principal components (or factors) which are linear combinations of the original variables. The number of useful components is much lower than to the number of original variables because these last ones are (more or less) correlated. PCA enables also to reveal the internal structure of the data because the components are constructed in a manner as to explain optimally the variance of the data.

PFA (Principal Factor Analysis) is often confused with PCA. There has been significant controversy about the equivalence or otherwise of the two techniques. One of the point of view which enables to distinguish them is to consider that the factors from the PCA account the maximal amount of variance of the available variables, while those from PFA account only the common variance in the data. The latter seems more appropriate if the goal of the analysis is to produce latent variables which highlight the underlying relation between the original variables. The influence of the variables which are not related to the other should be excluded.

They are thus different due to the nature of the information they make use. But the nuance is not obvious. Especially as they are often grouped in the same tool into some popular software (e.g. “PROC FACTOR” into SAS; “ANALYZE / DATA REDUCTION / FACTOR” into SPSS; etc.). In addition, their outputs and their interpretation are very similar.

In this tutorial, we present three approaches: Principal Component Analysis – PCA; non iterative Principal Factor Analysis - PFA; non iterative Harris Component Analysis - Harris. We highlight the differences by comparing the matrix (correlation matrix for the PCA) used for the diagonalization process. We detail the steps of the calculations using a program for R. We check our results by comparing them to those of SAS (PROC FACTOR). Thereafter, we implement these methods with Tanagra, with R using the PSYCH package, and with SPSS.

Keywords: PCA, principal component analysis, correlation matrix, principal factor analysis, harris, reproduced correlation, residual correlation, partial correlation, varimax rotation, R software, psych package, principal( ), fa( ), proc factor, SAS, SPSS
Components: PRINCIPAL COMPONENT ANALYSIS, PRINCIPAL FACTOR ANALYSYS, HARRIS COMPONENT ANALYSIS, FACTOR ROTATION
Tutorial: en_Tanagra_Principal_Factor_Analysis.pdf
Datasets: beer_rnd.zip
References:
D. Suhr, "Principal Component Analysis vs. Exploratory Factor Analysis".
Wikipedia, "Factor Analysis".

Friday, January 18, 2013

New features for PCA in Tanagra

Principal Component Analysis (PCA) is a very popular dimension reduction technique. The aim is to produce a few number of factors which summarizes as better as possible the amount of information in the data. The factors are linear combinations of the original variables. From a certain point a view, PCA can be seen as a compression technique.

The determination of the appropriate number of factors is a difficult problem in PCA. Various approaches are possible, it does not really exist a state-of-art method. The only way to proceed is to try different approaches in order to obtain a clear indication about the good solution. We had shown how to program them under R in a recent paper . These techniques are now incorporated into Tanagra 1.4.45. We have also added the KMO index (Measure of Sampling Adequacy – MSA) and the Bartlett's test of sphericity in the Principal Component Analysis tool.

In this tutorial, we present these new features incorporated into Tanagra on a realistic example. To check our implementation, we compare our results with those of SAS PROC FACTOR when the equivalent is available.

Keywords: principal component analysis, pca, sas, proc princomp, proc factor, bartlett's test of sphericity, R software, scree plot, cattell, kaiser-guttman, karlis saporta spinaki, broken stick approach, parallel analysis, randomization, bootstrap, correlation, partial correlation, varimax, factor rotation, variable clustering, msa, kmo index, correlation circle
Components: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT, PARALLEL ANALYSIS, BOOTSTRAP EIGENVALUES, FACTOR ROTATION, SCATTERPLOT, VARHCA
Tutorial: en_Tanagra_PCA_New_Tools.pdf
Dataset : beer_pca.xls
References:
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"
Tanagra - "Choosing the number of components in PCA"

Saturday, January 12, 2013

Choosing the number of components in PCA

Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors (or components) are linear combinations of the original variables.

Choosing the right number of factors is a crucial problem in PCA. If we select too much factors, we include noise from the sampling fluctuations in the analysis. If we choose too few factors, we lose relevant information, the analysis is incomplete. Unfortunately, there is not an indisputable approach for the determination of the number of factors. As a rule of thumb, we must select only the interpretable factors, knowing that the choice depends heavily on the domain expertise. And yet, this last one is not always available. We intend precisely to build on the data analysis to get a better knowledge on the studied domain.

In this tutorial, we present various approaches for the determination of the right number of factors for PCA based on the correlation matrix. Some of them, such as the Kaiser-Gutman rule or the scree plot method, are very popular even if they are not really statistically sound; others seems more rigorous, but seldom if ever used because they are not available in the popular statistical software suite.

In a first time, we use Tanagra and the Excel spreadsheet for the implementation of some methods; in a second time, especially for the resampling based approaches, we write programs for R from the results of the princomp() procedure.

Keywords: principal component analysis, factor analysis, pca, princomp, R software, bartlett's test of sphericity, xlsx package, scree plot, kaiser-guttman rule, broken-stick method, parallel analysis, randomization, bootstrap, correlation, partial correlation
Components: PRINCIPAL COMPONENT ANALYSIS, LINEAR CORRELATION, PARTIAL CORRELATION
Tutorial: en_Tanagra_Nb_Components_PCA.pdf
Dataset: crime_dataset_pca.zip
References :
D. Jackson, “Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches”, in Ecology, 74(8), pp. 2204-2214, 1993.
P. Neto, D. Jackson, K. Somers, “How Many Principal Components? Stopping Rules for Determining the Number of non-trivial Axes Revisited”, in Computational Statistics & Data Analysis, 49(2005), pp. 974-997, 2004.
Tanagra - "Principal Component Analysis (PCA)"
Tanagra - "VARIMAX rotation in Principal Component Analysis"
Tanagra - "PCA using R - KMO index and Bartlett's test"

Monday, January 7, 2013

PCA using R - KMO index and Bartlett's test

Principal Component Analysis (PCA) is a dimension reduction technique. We obtain a set of factors which summarize, as well as possible, the information available in the data. The factors are linear combinations of the original variables. The approach can handle only quantitative variables.

We have presented the PCA in previous tutorials. In this paper, we describe in details two indicators used for the checking of the interest of the implementation of the PCA on a dataset: the Bartlett's sphericity test and the KMO index. They are directly available in some commercial tools (e.g. SAS or SPSS). Here, we describe the formulas and we show how to program them under R. We compare the obtained results with those of SAS on a dataset.

Keywords: principal component analysis, pca, spss, sas, proc factor, princomp, kmo index, msa, measure of sampling adequacy, bartlett's sphericity test, xlsx package, psych package, R software
Components: VARHCA, PRINCIPAL COMPONENT ANALYSIS
Tutorial: en_Tanagra_KMO_Bartlett.pdf
Dataset: socioeconomics.zip
Références :
Tutoriel Tanagra - "Principal Component Analysis (PCA)"
Tutoriel Tanagra - "VARIMAX rotation in Principal Component Analysis"
SPSS - "Factor algorithms"
SAS - "The Factor procedure"