Saturday, April 11, 2015

R programming under Hadoop

The aim of this tutorial is to show the programming of the famous "word count" algorithm from a set of files stored in HDFS file system.

The "word count" is a state-of-the-art example for the programming under Hadoop. It is described everywhere on the web. But, unfortunately, the tutorials which describe the task are often not reproducible. The dataset are not available. The whole process, including the installation of the Hadoop framework, are not described. We do not know how to access to the files stored in the HDFS file system. In short, we cannot run programs and understand in details how they work.

In this tutorial, we describe the whole process. We detail first the installation of a virtual machine which contains a single-node Hadoop cluster. Then we show how to install R and RStudio Server which allows us to write and run a program. Last, we write some programs based on the mapreduce scheme.

The steps, and therefore the source of errors, are numerous. We will use many screenshots to actually understand each operation. This is the reason of this unusual presentation format for a tutorial.

Keywords:  big data, big data analytics, mapreduce, package rmr2, package rhdfs, hadoop, rhadoop, logiciel R, rstudio, rstudio server, cloudera, R language
Tutorial: en_Tanagra_Hadoop_with_R.pdf
Files: hadoop_with_r.zip
References :
Tanagra Tutorial, "MapReduce with R", Feb. 2015. 
Hugh Devlin, "Mapreduce in R", Jan. 2014.