Sunday, October 27, 2013

Parallel programming in R

Personal computers become more and more efficient. They are mostly equipped with multi-core processors. At the same time, most of the data mining tools, free or not, are often based on single-threaded calculations. Only one core is used during calculations, while others remain inactive.

Previously, we have introduced two multithreaded variants of linear discriminant analysis in Sipina 3.10 and 3.11. During the analysis that allowed me to develop the solutions introduced in Sipina, I had much studied parallelization mechanisms available in other Data Mining Tools. They are rather scarce. I noted that highly sophisticated strategies are proposed for the R software. These are often environments that enable to develop programs for multi-core processors machines, multiprocessor machines, and even for computer cluster. I studied in particular the "parallel" package which is itself derived from 'snow' and 'multicore' packages. Let us be quite clear. The library cannot miraculously accelerate an existing procedure. It gives us the opportunity to effectively use the machines resources by rearranging properly the calculations. Basically, the idea is to break down the process into tasks that can be run in parallel. When these tasks are completed, we perform the consolidation.

In this tutorial, we detail the parallelization of the calculation of the within-class covariance matrix under R 3.0.0. In a first step, we describe single-threaded approach, but easily convertible i.e. the basic tasks are easily identifiable. As a second step, we use the tools of “parallel” and “doParallel” packages to run elementary tasks on the available cores. We will then compare the processing time. We note that, unlike the toy examples available on the web, the results are mixed. The bottleneck is the managing of the data when we handle a large dataset.

Keywords:  linear discriminant analysis, within-class covariance matrix, R software, parallel package, doparallel package, parLapply, mclapply, foreach
Didacticiel : en_Tanagra_Parallel_Programming_R.pdf
Fichiersparallel.R, multithreaded_lda.zip
Références :
R-core, Package 'parallel', April 18, 2013.