Wednesday, June 9, 2010

Handling large dataset in R - The "filehash" package

The processing of very large datasets is a crucial problem in data mining. To handle them, we must avoid to load the whole dataset into memory. The idea is quite simple: (1) we write all or a part of the dataset on the disk in a binary file format to allow a direct access; (2) the machine learning algorithms must be modified to efficiently access the values stored on the disk. Thus, the characteristics of the computer are no longer a bottleneck for the handling of a large dataset.

In this tutorial, we describe the great "filehash" package for R. It allows to copy (to dump) any kind of R objects into a file. We can handle these objects without loading them into main memory. This is especially useful for the data frame object. Indeed, we can perform a statistical analysis with the usual functions directly from a database on the disk. The processing capacities are vastly improved and, in the same time, we will note that the increase in computation time remains moderate.

To evaluate the "filehash" solution, we analyze the memory occupation and the computation time, with and without utilization of the package, during the performing of decision tree learning with rpart (rpart package) and a linear discriminant analysis with lda (MASS package). We perform the same experiments using SIPINA. Indeed, it provides also a swapping system (the data is dumped from the main memory to temporary files) for the handling of very large dataset. We can then compare the performances of the various solutions.

Keywords: very large dataset, filehash, decision tree, linear discriminant analysis, sipina, C4.5, rpart, lda
Tutorial: en_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf
Données : wave2M.txt.zip
References :
R package, "Filehash : Simple key-value database"
Yu-Sung Su's Blog, "Dealing with large dataset in R"
Tanagra Tutorial, "MapReduce with R", February 2015.
Tanagra Tutorial, "R programming under Hadoop", April 2015.