Wednesday, May 29, 2013

Multithreading for linear discriminant analysis

Most of the modern personal computers have multicore CPU. This increases considerably their processing capabilities. Unfortunately, the popular free data mining tools does not really incorporate the multithreaded processing in the data mining algorithms they provide, aside from particular case such as ensemble methods or cross-validation process. The main reason of this scarcity is that it is impossible to define a generic framework whatever the mining method. We must study carefully the sequential algorithm, detect the opportunity of multithreading, and reorganize the calculations. We deal with several constraints: we must not increase excessively the memory occupation, we must use all the available cores, and we must balance the loads on the threads. Of course, the solution must be simple and operational on the usual personal computers.

Previously, we implemented a solution for the decision tree induction in Sipina 3.5. We studied also the solutions incorporated in Knime and RapidMiner. We show that the multithreaded programs outperform the single-thread version. This is wholly natural. But we observed also that there is not a unique solution. The internal organization of the multithread calculations influences the behavior and the performance of the program . In this tutorial, we present a multithreaded implementation for the linear discriminant analysis in SIPINA 3.10. The main property of the solution is that the calculation structure requires the same  amount of memory compared with the sequential program. We note that in some situations, the execution time can be decreased significantly.

The linear discriminant analysis is interesting in our context. We obtain a linear classifier which has a similar classification performance to the other linear method on the most of the real databases, especially compared with the logistic regression which is really popular (Saporta, 2006 – page 480; Hastie et al., 2013 – page 128). But the computation of the discriminant analysis is comparably really faster. We will see that this characteristic can be enhanced when we take advantage of the multicore architecture.

To better evaluate the improvements induced by our strategy, we compare our execution time with tools such as SAS 9.3 (proc discrim), R (lda of the MASS package) and Revolution R Community (an "optimized" version of R).

Keywords: sipina, multithreading, thread, multithreaded data mining, multithread processing, linear discriminant analysis, sas, proc discrim, R software, lda, MASS package
Tutorial: en_Tanagra_Sipina_LDA_Threads.pdf
Tanagra, "Multithreading for decision tree induction".
S. Rathburn, A. Wiesner, S. Basu, "STAT 505: Applied Multivariate Statistical Analysis", Lesson 10: Discriminant Analysis,  PennState, Online Learning: Department of Statistics.