Tanagra - Data Mining and Data Science Tutorials: September 2013

Sunday, September 29, 2013

Load balanced multithreading for LDA

In a previous paper, we described a multithreading strategy for the linear discriminant analysis . The aim was to take advantage of the multicore processors of the recent computers. We noted that for the same memory occupation than the standard implementation, we can decrease dramatically the computation time according to the dataset characteristics. The solution however had two drawbacks: the number of cores used was dependent on the number of classes K of the dataset; the load of the cores depended on classes’ distributions. For instance, for one of dataset with K = 2 highly unbalanced classes, the gain was negligible compared to the single-threaded version.

In this paper, we present a new approach for the multithreaded implementation of the linear discriminant analysis, available in Sipina 3.11. It allows to overcome the two bottlenecks of the previous version. The capacity of the machine is fully used. More interesting, the number of used threads (cores) becomes customizable, allowing the user to adapt the machines resources used to process the database. But this is not without consideration. The memory occupation is increased. It depends on both the characteristics of the data and the number of cores that we want to use.

To evaluate the improvement introduced in this new version, we use various benchmark datasets to compare its computation time with those of the previous multithreaded approach, the single-threaded version, and the state-of-the-art proc discrim of SAS 9.3.

Keywords: sipina, multithreading, thread, multithreaded data mining, multithread processing, linear discriminant analysis, sas, proc discrim, R software, lda, MASS package, load balancing
Components: LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Sipina_LDA_Threads_Bis.pdf
Dataset: multithreaded_lda.zip
References:
S. Rathburn, A. Wiesner, S. Basu, "STAT 505: Applied Multivariate Statistical Analysis", Lesson 10: Discriminant Analysis, PennState, Online Learning: Department of Statistics.

Sunday, September 15, 2013

Tanagra - Version 1.4.49

Some enhancements regarding factor analysis approaches (PCA - principal component analysis, MCA - multiple correspondence analysis, CA - correspondence analysis, FDMA - factorial analysis of mixed data) have been incorporated. In particular, outputs have been completed.

The VARIMAX rotation has been improved. Thanks to Frédéric Glausinger for optimized source code.

The Benzecri correction is added in the MCA outputs. Thanks to Bernard Choffat for this suggestion.

Download page : setup