Tanagra - Data Mining and Data Science Tutorials: October 2017

Wednesday, October 25, 2017

CDF and PPF in Excel, R and Python

How to compute the cumulative distribution functions and the percent point functions of various commonly used distributions in Excel, R and Python.

I use Excel (in conjunction with Tanagra or Sipina), R and Python for the practical classes of my courses about data mining and statistics at the University. Often, I ask students to perform hypothesis tests or to calculate confidence intervals, etc.

We work on computers, it is obviously out of the question to use the statistical tables to obtain the quantile or p-value of the commonly used distribution functions. In this tutorial, I present the main functions for normal distribution, Student's t-distribution, chi-squared distribution and Fisher-Snedecor distribution. I realized that students sometimes find it difficult to match the reading of statistical tables with the functions they have difficulty identifying in software. It is also an opportunity for us to verify the equivalences between the functions proposed by Excel, R (stats package) and Python (scipy package). Whew! At least on the few illustrative examples given in our document, the results are consistent.

Keywords: excel, r, stats package, python, scipy package, p-value, quantile, cdf, cumulative distribution function, ppf, percent point function, quantile function
Tutorial: CDF and PPF

Wednesday, October 18, 2017

The "compiler" package for R

It is widely agreed that R is not a fast language. Notably, because it is an interpreted language. To overcome this issue, some solutions exists which allow to compile functions written in R. The gains in computation time can be considerable. But it depends on our ability to write code that can benefit from these tools.

In this tutorial, we study the efficiency of the Luke Tierney's “compiler” package which is provided in the base distribution of R. We program two standard data analysis treatments, (1) with and (2) without using loops: the scaling of variables in a data frame; the calculation of a correlation matrix by matrix product. We compare the efficiency of non-compiled and compiled versions of these functions.

We observe that the gain for the compiled version is dramatic for the version with loops, but negligible for the second variant. We note also that, in the R 3.4.2 version used, it is not needed to compile explicitly the functions containing loops because it exists a JIT (just in time compilation) mechanism which ensure to our code the maximal performance.

Keywords: package compiler, cmpfun, byte code, package rbenchmark, benchmark, JIT, just in time
Tutorial: en_Tanagra_R_compiler_package.pdf
Program: compilation_r.zip
References :
Luke Tierney, "A Byte Code Compiler for R", Department of Statistics and Actuarial Science, University of Iowa, March 30, 2012.
Package 'compiler' - "Byte Code Compiler"

Monday, October 9, 2017

Regression analysis in Python

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

In this tutorial, we will try to identify the potentialities of StatsModels by conducting a case study in multiple linear regression. We will discuss about: the estimation of model parameters using the ordinary least squares method, the implementation of some statistical tests, the checking of the model assumptions by analyzing the residuals, the detection of outliers and influential points, the analysis of multicollinearity, the calculation of the prediction interval for a new instance.

Keywords: regression, statsmodels, pandas, matplotlib
Tutorial: en_Tanagra_Python_StatsModels.pdf
Dataset and program: en_python_statsmodels.zip
References:
StatsModels: Statistics in Python

Thursday, October 5, 2017

Document classification in Python

The aim of text categorization is to assign documents to predefined categories as accurately as possible. We are within the supervised learning framework, with a categorical target attribute, often binary. The originality lies in the nature of the input attribute, which is a textual document. It is not possible to implement predictive methods directly, it is necessary to go through a data preparation phase.

In this tutorial, we will describe a text categorization process in Python using mainly the text mining capabilities of the scikit-learn package, which will also provide data mining methods (logistics regression). We want to classify SMS as "spam" (spam, malicious) or "ham" (legitimate). We use the “SMS Spam Collection v.1” dataset.

Keywords: text mining, document categorization, corpus, bag of words, f1-score, recall, precision, dimensionality reduction, variable selection, logistic regression, scikit learn, python
Tutorial: Spam identification
Dataset: Corpus and Python program
References:
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, "A. Contributions to the Study of SMS Spam Filtering: New Collection and Results", in Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.