Sunday, February 22, 2015

MapReduce with R

Big Data is a very popular topic these last years. The big data analytics refers to the process to discovering useful information or knowledge from big data. That is an important issue for organizations. In concrete terms, the aim is to extend, adapt or even create novel exploratory data analysis or data mining approaches to new data sources of which the main characteristics are “volume”, “variety” and “velocity”.

Distributed computing is essential in the big data context. It is illusory to want infinitely increase the power of servers for following the exponential growth of information to process. The solution depends on the efficient cooperation of a myriad of networked computers, ensuring both the volume management and computing power. Hadoop is a solution commonly cited for this requirement. This is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. For the implementation of distributed programs, the MapReduce programming model plays an important role. The processing of large dataset can be implemented with parallel algorithms on a cluster of connected computers (nodes).

In this tutorial, we are interested in MapReduce programming in R. We use the technology RHadoop of the Revolution Analytics Company. The "rmr2" package in particular allows to learn the MapReduce programming without having to install the Hadoop environment which is already sufficiently complicated. There are some tutorials about this subject on the web. The one of Hugh Devlin (January 2014) is undoubtedly one of the most interesting . But, it is perhaps too sophisticated for the students which are not very familiar with the programming in R. So I decided to start afresh with very simple examples in a first time. Then, in a second time, we progress by programming a simple data mining algorithm such as the multiple linear regression.

Keywords: big data, big data analytcis, mapreduce, rmr2 package, hadoop, rhadoop, one-way anova, linear regression
Tutorial: en_Tanagra_MapReduce.pdf
Dataset: en_mapreduce_with_r.zip
References:
Hugh Devlin, "Mapreduce in R", Jan. 2014.
Tutoriel Tanagra, "Parallel programming in R", october 2013.