Tanagra - Data Mining and Data Science Tutorials: 2015

Thursday, December 31, 2015

R online with R-Fiddle

R-Fiddle is a programming environment for R available online. It allows us to encode and to run a program written in R.

Although R is free and there are also good free programming environments for R (e.g. R-Studio desktop, Tinn-R), this type of tool has several interests. It is suitable for mobile users who frequently change machine. If we have an Internet connection, we can work on a project without having to worry about the R installation on PCs. Collaborative work is another context in which this tool can be particularly advantageous. It allows us to avoid the transfer of files and the management of versions. Last, the solution allows us to work on a lightweight front-end, a laptop for example, and export the calculations on a powerful remote server (in the cloud as we would say today).

In this tutorial, we will briefly review the features of R-Fiddle.

Keywords: R software, R programming, cloud computing, linear discriminant analysis, logistic regression, classification tree, klaR package, rpart package, feature selection
Tutorial: en_Tanagra_R_Fiddle.pdf
Files: en_r_fiddle.zip
References:
R-Fiddle - http://www.r-fiddle.org/#/

Wednesday, December 30, 2015

Random Forest - Boosting with R and Python

This tutorial follows the slideshow devoted to the "Bagging, Random Forest and Boosting". We show the implementation of these methods on a data file. We will follow the same steps as the slideshow i.e. we first describe the construction of a decision tree, we measure the prediction performance, and then we see how ensemble methods can improve the results. Various aspects of these methods will be highlighted: the measure of the variable importance, the influence of the parameters, the influence of the characteristics of the underlying classifier (e.g. controlling the tree size), etc.

As a first step, we will focus on R (rpart, adabag and randomforest packages) and Python (scikit-learn package). We can multiply analyses by programming. Among others, we can evaluate the influence of parameters on the performance. As a second step, we will explore the capabilities of software (Tanagra and Knime) providing turnkey solutions, very simple to implement, more accessible for people which do not like programming.

Keywords: R software, R programming, decision tree, classification tree, adabag package, rpart package, randomforest package, Python, scikit-learn package, bagging, boosting, random forest
Components: BAGGING, RND TREE, BOOSTING, C4.5, DISCRETE SELECT EXAMPLES
Tutorial: Bagging, Random Forest et Boosting
Files: randomforest_boosting_en.zip
References:
R. Rakotomalala, "Bagging, Random Forest, Boosting (slides)", December 2015.

Wednesday, December 23, 2015

Bagging, Random Forest, Boosting (slides)

This course material presents ensemble methods: bagging, random forest and boosting. These approaches are based on the same guiding idea : a set of base classifiers learned from the an unique learning algorithm are fitted to different versions of the dataset.

For bagging and random forest, the models are fitted independently of bootstrap samples. Random Forest incorporates an additional mechanism in order to “decorrelate” the models which are necessarily decision trees.

Boosting works in a sequential fashion. A model at the step (t) is fitted to a weighted version of the sample in order to correct the error of the model learned at the preceding step (t-1).

Keywords: bagging, boosting, random forest, decision tree, rpart package, adabag package, randomforest package, R software
Slides: Bagging - Random Forest - Boosting
References :
Breiman L., "Bagging Predictors", Machine Learning, 26, p. 123-140, 1996.
Breiman L., "Random Forests", Machine Learning, 45, p. 5-32, 2001.
Freund Y., Schapire R., "Experiments with the new boosting algorithm", International Conference on Machine Learning, p. 148-156, 1996.
Zhu J., Zou H., Rosset S., Hastie T., "Multi-class AdaBoost", Statistics and Its Interface, 2, p. 349-360, 2009.

Sunday, December 20, 2015

Python - Machine Learning with scikit-learn (slides)

This course material presents some modules and classes of scikit-learn, a library for machine learning in Python.

We focused on a typical classification process as a first step: the subdivision of the dataset into training and test sets; the learning of the logistic regression on the training sample; applying the model to the test set in order to obtain the predicted class values; the evaluation of the classifier using the confusion matrix and the calculation of the performance measurements.

In the second step, we study other important domains of the classification task: the cross-validation error evaluation when we deal with a small dataset; the scoring process for direct marketing; the grid search for detecting the optimal parameters of algorithms for a given dataset; the feature selection issue.

Keywords: python, numpy, pandas, scikit-learn, logistic regression, predictive analytics
Slides: Machine Learning with scikit-learn
Dataset and programs: scikit-learn - Programs and dataset
References :
"scikit-learn -- Machine Learning in Python" on scikit-learn.org
Python - Official Site

Tuesday, December 8, 2015

Python - Statistics with SciPy (slides)

This course material presents the use of some modules of SciPy, a library for scientific computing in Python. We study especially the stats package, it allows to perform statistical tests such as comparison of means for independent and related samples, comparison of variances, measuring the association between two variables. We study also the cluster package, especially the k-means and the hierarchical agglomerative clustering algorithms.

SciPy handles NumPy vectors and matrices which were presented previously.

Keywords: python, numpy, scipy, descriptive statistics, cumulative distribution functions, sampling, random number generator, normality test, test for comparing populations, pearson correlation, spearman correlation, cluster analysis, k-means, hac, dendrogram
Slides: scipy.stats and scipy.cluster
Dataset and programs: SciPy - Programs and dataset
References :
SciPy Reference Guide sur SciPy.org
Python - Official Site

Wednesday, October 28, 2015

Python - Handling matrices with NumPy (slides)

This course material presents the manipulation of matrices using NumPy. The array type is common to vectors and matrices. The special feature is the addition of a second dimension in order to have values within a rows x columns structure.

The matrices pave the way to operators which play a fundamental role in statistical modeling and exploratory data analysis (e.g. matrix inversion, solving equations, calculation of eigenvalues and eigenvectors, singular value decomposition, etc.).

Keywords: langage python, numpy, vector, matrix, array, creation, extraction
Slides: NumPy Matrices
Datasets and programs: Matrices
References :
NumPy Reference sur SciPy.org
Haenel, Gouillart, Varoquaux, "Python Scientific Lecture Notes".
Python - Official Site

Thursday, October 8, 2015

Python - Handling vectors with NumPy (slides)

Python is becoming more and more popular in the eyes of Data Scientists. I decided to introduce Statistical Programming in Python among my teachings at the University (reference page in French).

This first course material described the handling of vectors of NumPy library. The structure and functionality have a certain similarity with the vectors under R.

Keywords: langage python, numpy, vector, array, creation, extraction
Slides: NumPy Vectors
Datasets and programs: Vectors
References :
NumPy Reference sur SciPy.org
Haenel, Gouillart, Varoquaux, "Python Scientific Lecture Notes".
Python - Official Site

Tuesday, June 2, 2015

Cross-validation, leave-one-out, bootstrap (slides)

In supervised learning, it is commonly accepted that one should not use the same sample to build a predictive model and estimate its error rate. The error obtained under these conditions - called resubstitution error rate - is (very often) too optimistic, leaving to believe that the model will present an excellent performance in prediction.

A typical approach is to divide the data into 2 parts (holdout approach): a first sample, said train sample is used to construct the model; a second sample, said test sample, is used to measure its performance. The measured error rate reflects honestly the model behavior in generalization. Unfortunately, on small dataset, this approach is problematic. By reducing the amount of data presented to the learning algorithm, we cannot learn correctly the underlying relation between the descriptors and the class attribute. At the same time, the part devoted to testing remains limited, the measured error has high variance.

In this document, I present resampling techniques (cross-validation, leave-one-out and bootstrap) for estimating the error rate of the model constructed from the totality of the available data. A study on simulated data (the "waves" dataset; Breiman and al., 1984) is used to analyze the behavior of approaches according to various learning algorithms (decision trees, linear discriminant analysis, neural networks [perceptron]).

Keywords: resampling, cross-validation, leave-one-out, bootstrap, error rate estimation, holdout, resubstitution, train, test, learning sample
Components (Tanagra): CROSS-VALIDATION, BOOTSTRAP, TEST, LEAVE-ONE-OUT
Slides: Error rate estimation
References:
A. Molinaro, R. Simon, R. Pfeiffer, « Prediction error estimation: a comparison of resampling methods », in Bioinformatics, 21(15), pages 3301-3307, 2005.
Tanagra tutorial, "Resampling methods for error estimation", July 2009.

Saturday, April 11, 2015

R programming under Hadoop

The aim of this tutorial is to show the programming of the famous "word count" algorithm from a set of files stored in HDFS file system.

The "word count" is a state-of-the-art example for the programming under Hadoop. It is described everywhere on the web. But, unfortunately, the tutorials which describe the task are often not reproducible. The dataset are not available. The whole process, including the installation of the Hadoop framework, are not described. We do not know how to access to the files stored in the HDFS file system. In short, we cannot run programs and understand in details how they work.

In this tutorial, we describe the whole process. We detail first the installation of a virtual machine which contains a single-node Hadoop cluster. Then we show how to install R and RStudio Server which allows us to write and run a program. Last, we write some programs based on the mapreduce scheme.

The steps, and therefore the source of errors, are numerous. We will use many screenshots to actually understand each operation. This is the reason of this unusual presentation format for a tutorial.

Keywords: big data, big data analytics, mapreduce, package rmr2, package rhdfs, hadoop, rhadoop, logiciel R, rstudio, rstudio server, cloudera, R language
Tutorial: en_Tanagra_Hadoop_with_R.pdf
Files: hadoop_with_r.zip
References :
Tanagra Tutorial, "MapReduce with R", Feb. 2015.
Hugh Devlin, "Mapreduce in R", Jan. 2014.

Sunday, February 22, 2015

MapReduce with R

Big Data is a very popular topic these last years. The big data analytics refers to the process to discovering useful information or knowledge from big data. That is an important issue for organizations. In concrete terms, the aim is to extend, adapt or even create novel exploratory data analysis or data mining approaches to new data sources of which the main characteristics are “volume”, “variety” and “velocity”.

Distributed computing is essential in the big data context. It is illusory to want infinitely increase the power of servers for following the exponential growth of information to process. The solution depends on the efficient cooperation of a myriad of networked computers, ensuring both the volume management and computing power. Hadoop is a solution commonly cited for this requirement. This is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. For the implementation of distributed programs, the MapReduce programming model plays an important role. The processing of large dataset can be implemented with parallel algorithms on a cluster of connected computers (nodes).

In this tutorial, we are interested in MapReduce programming in R. We use the technology RHadoop of the Revolution Analytics Company. The "rmr2" package in particular allows to learn the MapReduce programming without having to install the Hadoop environment which is already sufficiently complicated. There are some tutorials about this subject on the web. The one of Hugh Devlin (January 2014) is undoubtedly one of the most interesting . But, it is perhaps too sophisticated for the students which are not very familiar with the programming in R. So I decided to start afresh with very simple examples in a first time. Then, in a second time, we progress by programming a simple data mining algorithm such as the multiple linear regression.

Keywords: big data, big data analytcis, mapreduce, rmr2 package, hadoop, rhadoop, one-way anova, linear regression
Tutorial: en_Tanagra_MapReduce.pdf
Dataset: en_mapreduce_with_r.zip
References:
Hugh Devlin, "Mapreduce in R", Jan. 2014.
Tutoriel Tanagra, "Parallel programming in R", october 2013.