Tanagra - Data Mining and Data Science Tutorials

Automatic translation of tutorials

2019-05-07T10:04:00.004+02:00

For nearly 10 years, I constantly translated my tutorials into English because I found that machine translation tools were not very efficient. This work was very tedious, but I convinced myself that I had to do it. For a year now, I have realized that these tools provide good translations, so much so that I used them directly for the latest English documents I have produced. Under these conditions, it seems more appropriate to me to direct you to my tutorials in French and advise you to use the online machine translation software.

Thank you very much for all these years of following the publications I have made on this blog. The adventure is not over yet because I still continue, and for a long time I hope, to produce course and tutorial materials for researchers and students.

Ricco Rakotomalala

Lyon, May 5th, 2019.

Tanagra website statistics for 2017

2018-01-03T18:22:00.001+01:00

The year 2017 ends, 2018 begins. I wish you all a very happy year 2018.

A small statistical report on the website statistics for 2017. All sites (Tanagra, course materials, e-books, tutorials) has been visited 222,293 times this year, 609 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there was 2,33,371 visits (644 daily visits).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, ...

39 new course materials and tutorials were posted online this year: 18 in French language, 21 in English.

The pages containing course materials about Data Science and Programming (R and Python) are the most popular ones. This is not really surprising.

Happy New Year 2018 to all.

Ricco.
Slideshow: Website statistics for 2017

Sparse data file format

2018-01-02T21:47:00.001+01:00

The data to be processed with machine learning algorithms are increasing in size. Especially when we need to process unstructured data. The data preparation (e. g. the use of a bag of words representation in text mining) leads to the creation of large data tables where, often, the number of columns (descriptors) is higher than the number of rows (observations). With the singularity that the table contains many zero values. In this context, storing all these zero values into the data file is not opportune. A data compression strategy without loss of information must be implemented, which must remain simple so that the file is readable with a text editor.

In this tutorial, we describe the use of the sparse data file format handled by Tanagra (from the version 1.4.4). It is based on the file format processed by famous libraries for machine learning (svmlight, libsvm, libcvm). We show its use in a text categorization process applied to the Reuters database, well known in data mining. We will observe that the use of this kind of sparse format enables to reduce dramatically the data file size.

Keywords: sparse dataset, dense dataset, attribute-value table, support vector machine, svm, libsvm, c-svc, logistic regression, tr-irls, scoring, roc curve, auc, area under curve
Componets: VIEW DATASET, CONT TO DISC, UNIVARIATE DUISCRETE STAT, SELECT FIRST EXAMPLES, C-SVC, SCORING, ROC CURVE
Tutorial: en_Tanagra_Sparse_File_Format.pdf
Dataset: reuters.data.zip
References:
T. Joachims, "SVMlight: Support Vector Machine".
UCI Repository, "Reuters-21578 Text Categorization Collection".

Configuration of a multilayer perceptron

2017-12-29T22:10:00.001+01:00

The multilayer perceptron is one of the most popular neural network approach for supervised learning, and that it was very effective if we know to determine the number of neurons in the hidden layers.

In this tutorial, we will try to explain the role of neurons in the hidden layer of the multilayer perceptron (when we have one hidden layer). Using an artificial toy dataset, we show the behavior of the classifier when we modify the number of neurons.

We work with Tanagra in a first step. Then, we use R (nnet package) to create a program to determine automatically the right number of neurons into the hidden layer.

Keywords: neural network, perceptron, multilayer perceptron, MLP
Components: MULTILAYER PERCEPTRON, FORMULA
Tutorial: Configuration of a MLP
Dataset: artificial2d.zip
References:
Tanagra Tutorials, "Single layer and multilayer perceptron (slides)", September 2014.
Tanagra Tutorials, "Multilayer perceptron - Software comparison", November 2008.

CDF and PPF in Excel, R and Python

2017-10-25T11:00:00.001+02:00

How to compute the cumulative distribution functions and the percent point functions of various commonly used distributions in Excel, R and Python.

I use Excel (in conjunction with Tanagra or Sipina), R and Python for the practical classes of my courses about data mining and statistics at the University. Often, I ask students to perform hypothesis tests or to calculate confidence intervals, etc.

We work on computers, it is obviously out of the question to use the statistical tables to obtain the quantile or p-value of the commonly used distribution functions. In this tutorial, I present the main functions for normal distribution, Student's t-distribution, chi-squared distribution and Fisher-Snedecor distribution. I realized that students sometimes find it difficult to match the reading of statistical tables with the functions they have difficulty identifying in software. It is also an opportunity for us to verify the equivalences between the functions proposed by Excel, R (stats package) and Python (scipy package). Whew! At least on the few illustrative examples given in our document, the results are consistent.

Keywords: excel, r, stats package, python, scipy package, p-value, quantile, cdf, cumulative distribution function, ppf, percent point function, quantile function
Tutorial: CDF and PPF

The "compiler" package for R

2017-10-18T15:23:00.002+02:00

It is widely agreed that R is not a fast language. Notably, because it is an interpreted language. To overcome this issue, some solutions exists which allow to compile functions written in R. The gains in computation time can be considerable. But it depends on our ability to write code that can benefit from these tools.

In this tutorial, we study the efficiency of the Luke Tierney's “compiler” package which is provided in the base distribution of R. We program two standard data analysis treatments, (1) with and (2) without using loops: the scaling of variables in a data frame; the calculation of a correlation matrix by matrix product. We compare the efficiency of non-compiled and compiled versions of these functions.

We observe that the gain for the compiled version is dramatic for the version with loops, but negligible for the second variant. We note also that, in the R 3.4.2 version used, it is not needed to compile explicitly the functions containing loops because it exists a JIT (just in time compilation) mechanism which ensure to our code the maximal performance.

Keywords: package compiler, cmpfun, byte code, package rbenchmark, benchmark, JIT, just in time
Tutorial: en_Tanagra_R_compiler_package.pdf
Program: compilation_r.zip
References :
Luke Tierney, "A Byte Code Compiler for R", Department of Statistics and Actuarial Science, University of Iowa, March 30, 2012.
Package 'compiler' - "Byte Code Compiler"

Regression analysis in Python

2017-10-09T21:17:00.002+02:00

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

In this tutorial, we will try to identify the potentialities of StatsModels by conducting a case study in multiple linear regression. We will discuss about: the estimation of model parameters using the ordinary least squares method, the implementation of some statistical tests, the checking of the model assumptions by analyzing the residuals, the detection of outliers and influential points, the analysis of multicollinearity, the calculation of the prediction interval for a new instance.

Keywords: regression, statsmodels, pandas, matplotlib
Tutorial: en_Tanagra_Python_StatsModels.pdf
Dataset and program: en_python_statsmodels.zip
References:
StatsModels: Statistics in Python

Document classification in Python

2017-10-05T22:55:00.003+02:00

The aim of text categorization is to assign documents to predefined categories as accurately as possible. We are within the supervised learning framework, with a categorical target attribute, often binary. The originality lies in the nature of the input attribute, which is a textual document. It is not possible to implement predictive methods directly, it is necessary to go through a data preparation phase.

In this tutorial, we will describe a text categorization process in Python using mainly the text mining capabilities of the scikit-learn package, which will also provide data mining methods (logistics regression). We want to classify SMS as "spam" (spam, malicious) or "ham" (legitimate). We use the “SMS Spam Collection v.1” dataset.

Keywords: text mining, document categorization, corpus, bag of words, f1-score, recall, precision, dimensionality reduction, variable selection, logistic regression, scikit learn, python
Tutorial: Spam identification
Dataset: Corpus and Python program
References:
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, "A. Contributions to the Study of SMS Spam Filtering: New Collection and Results", in Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

SVM: Support Vector Machine in R and Python

2017-09-28T09:15:00.003+02:00

This tutorial completes the course material devoted to the Support Vector Machine approach (SVM).

It highlights two important dimensions of the method: the position of the support points and the definition of the decision boundaries in the representation space when we construct a linear separator; the difficulty to determine the “best” values of the parameters for a given problem.

We will use R (“e1071” package) and Python (“scikit-learn” package).

Keywords: svm, package e1071, logiciel R, logiciel Python, package scikit-learn, sklearn
Tutorial: SVM - Support Vector Machine
Dataset and programs: svm_r_python.zip
References:
Tanagra Tutorial, "Support Vector Machine", May 2017.
Tanagra Tutorial, "Implementing SVM on large dataset", July 2009.

Association rule learning with ARS

2017-09-11T18:05:00.003+02:00

SIPINA is known for its decision tree induction algorithms. In fact, the distribution includes two other tools that are little known to the public: REGRESS, which is specialized in multiple linear regression, we described it in one of our tutorials ; and an association rules extraction tool, called simply Association Rule Software (ARS).

In this tutorial, I describe the use of the ARS tool. Its interactivity with Excel spreadsheet is its main advantage. We launch the software from Excel using the “sipina.xla” add-in. We can easily retrieve the rules in the spreadsheet. Then, we can explore them (the mined rules) using the Excel data handling capabilities. The ability to filter and sort rules according to different criteria is a great help in detecting interesting rules. This is a very important aspect because the profusion of rules can quickly confuse the data miner.

Keywords: ARS, association rule software, excel spreadsheet, filtering and sorting rules, interestingness measures
Components: ASSOCIATION RULE SOFTWARE
Tutorial: en_Tanagra_Association_Sipina.pdf
Dataset: market_basket.zip
References:
Tanagra Tutorial, "Association rule learning (slides)", August 2014.

Linear classifiers

2017-08-25T09:23:00.002+02:00

In this tutorial, we study the behavior of 5 linear classifiers on artificial data. Linear models are often the baseline approaches in supervised learning. Indeed, based on a simple linear combination of predictive variables, they have the advantage of simplicity: the reading of the influence of each descriptor is relatively easy (signs and values of the coefficients); learning techniques are often (not always) fast, even on very large databases. We are interested in: (1) the naive bayes classifier; (2) the linear discriminant analysis; (3) the logistic regression; (4) the perceptron (single-layer perceptron); (5) the support vector machine (linear SVM).

The experiment was conducted under R. The source code accompanies this document. My idea, besides the theme of the linear classifiers that concerns us, is also to describe the different stages of the elaboration of an experiment for the comparison of learning techniques. In addition, we show also the results provided by the linear approaches implemented in various tools such as Tanagra, Knime, Orange, Weka and RapidMiner.

Keywords: linear classifier, naive bayes, linear discriminant analysis, logistic regression, perceptron, neural network, linear svm, support vector machine, decision tree, rpart, random forest, k-nn, nearest neighbors, e1071 package, nnet package, rf package, class package
Components : NAIVE BAYES CONTINUOUS, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, MULTILAYER PERCEPTRON, SVM
Tutorial: en_Tanagra_Linear_Classifier.pdf
Programs and dataset: linear_classifier.zip
References:
Wikipedia, "Linear Classifier".

Discriminant analysis and linear regression

2017-08-18T13:46:00.003+02:00

Linear discriminant analysis and linear regression are both supervised learning techniques. But, the first one is related to classification problems i.e. the target attribute is categorical; the second one is used for regression problems i.e. the target attribute is continuous (numeric).

However, there are strong connections between these approaches when we deal with a binary target attribute. From a practical example, we describe the connections between the two approaches in this case. We detail the formulas for obtaining the coefficients of discriminant analysis from those of linear regression.

We perform the calculations under Tanagra and R.

Keywords: linear discriminant analysis, predictive discriminant analysis, multiple linear regression, wilks' lambda, mahalanobis distance, score function, linear classifier, sas, proc discrim, proc stepdisc
Components: LINEAR DISCRIMINANT ANALYSIS, MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_LDA_and_Regression.pdf
Programs and dataset: lda_regression.zip
References:
C.J. Huberty, S. Olejnik, « Applied MANOVA and Discriminant Analysis »,Wiley, 2006.
R. Tomassone, M. Danzart, J.J. Daudin, J.P. Masson, « Discrimination et Classement », Masson, 1988.

Gradient boosting with R and Python

2017-08-11T22:20:00.003+02:00

This tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References).

The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers.

The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with each other. Here, more than for other machine learning methods, the trial and error strategy takes a lot of importance.

We use R and Python with their appropriate packages.

Keywords: gradient boosting, R software, decision tree, adabag package, rpart, xgboost, gbm, mboost, Python, scikit-learn package, gridsearchcv, boosting, random forest
Tutorial: Gradient boosting
Programs and datasets: gradient_boosting.zip
References:
Tanagra tutorial, "Gradient boosting - Slides", June 2016.
Tanagra tutorial, "Bagging, Random Forest, Boosting - Slides", December 2015.
Tanagra tutorial, "Random Forest and Boosting with R and Python", December 2015.

Statistical analysis with Gnumeric

2017-08-04T07:30:00.001+02:00

The spreadsheet is a valuable tool for data scientist. This is what the annual KDnuggets polls reveal during these last years where Excel spreadsheet is always well placed. In France, this popularity is largely confirmed by its almost systematic presence in job postings related to the data processing (statistics, data mining, data science, big data/data analytics, etc.). Excel is specifically referred, but this success must be viewed as an acknowledgment of the skills and capabilities of the spreadsheet tools.

This tutorial is devoted to the Gnumeric Spreadsheet 1.12.12. It has interesting features: Setup and installation programs are small because it is not part of an office suite; It is fast and lightweight; It is dedicated to numerical computation and natively incorporates a "statistics" menu with the common statistical procedures (parametric tests, non-parametric tests, regression, principal component analysis, etc.); and, it seems more accurate than some popular spreadsheets programs. These last two points have caught my attention and have convinced me to study it in more detail. In the following, we make a quick overview of Gnumeric's statistical procedures. If it is possible, we compare the results with those of Tanagra 1.4.50.

Keywords: gnumeric, spreadsheet, descriptive statistics, principal component analysis, pca, multiple linear regression, wilcoxon signed rank test, welch test unequal variance, mann and whitney, analysis of variance, anova
Tanagra components: MORE UNIVARIATE CONT STAT, PRINCIPAL COMPONENT ANALYSIS, MULTIPLE LINEAR REGRESSION, WILCOXON SIGNED RANKS TEST, T-TEST UNEQUAL VARIANCE, MANN-WHITNEY COMPARISON, ONE-WAY ANOVA
Tutorial: en_Tanagra_Gnumeric.pdf
Dataset : credit_approval.zip
References :
Gnumeric, "The Gnumeric Manual, version 1.12".

Failure resolved

2017-08-02T07:36:00.001+02:00

Hi,

It seems that the failure has been resolved since yesterday "August 1st, 2017".

Again, sorry for the inconvenience. I hope that the continuity of service will be ensured throughout the summer.

Kind regards,

Ricco (August 2nd, 2017).

File server outage

2017-07-27T23:50:00.003+02:00

Since a few days (since the 07/24/2017 approximately), the server of the Eric laboratory that hosts the Tanagra project files (software, books, course materials, tutorials...) is idle. After a power outage, there is nobody to restart the server during the summer period. And the server is located in a room in which I do not have access.

So we wait. And it will take a little time, the summer break lasts a month, our University (and Lab) is officially reopened on August 21st! I am sorry for users that work from the documents that I put online. This difficulty is totally beyond my control and I cannot do anything about it.

Some internet users are reported to me the problem. I take the initiative to inform you. As soon as the situation is back in order, I will let you know.

Kind regards,

Ricco (July 27th, 2017).

Interpreting cluster analysis results

2017-07-22T18:28:00.003+02:00

Interpretation of the clustering structure and the clusters is an essential step in unsupervised learning. Identifying the characteristics that underlie differentiation between groups allows to ensuring their credibility.

In this course material, we explore the univariate and multivariate techniques. The first ones have the merit of the ease of calculation and reading, but do not take into account the joint effect of the variables. The seconds are a priori more efficient, but require additional expertise to fully understand the results.

Keywords: cluster analysis, clustering, unsupervised learning, percentage of variance explained, V-Test, test value, distance between centroids, correlation ratio
Slides: Characterizing the clusters
References:
Tanagra Tutorial, "Understanding the 'test value' criterion", May 2009.
Tanagra Tutorial, "Hierarchical agglomerative clustering", June 2017.
Tanagra Tutorial, "K-Means clustering", June 2017.

Kohonen map with R

2017-07-14T09:59:00.003+02:00

This tutorial complements the course material concerning the Kohonen map or Self-organizing map (June 2017). In a first time, we try to highlight two important aspects of the approach: its ability to summarize the available information in a two-dimensional space; Its combination with a cluster analysis method for associating the topological representation (and the reading that one can do) to the interpretation of the groups obtained from the clustering algorithm. We use the R software and the “Kohonen” package (Wehrens et Buydens, 2007). In a second time, we carry out a comparative study of the quality of the partitioning with the one obtained with the K-means algorithm. We use an external evaluation i.e. we compare the clustering results with pre-established classes. This procedure is often used in research to evaluate the performance of clustering methods. It takes on its meaning when it is applied to artificial data where the true class membership is known. We use the K-Means and Kohonen-Som components of Tanagra.

This tutorial is based on the Shane Lynn's article on the R-bloggers website (Lynn, 2014). I completed it by introducing the intermediate calculations to better understand the meaning of the charts, and by conducting the comparative study.

Keywords: som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package, k-means, external evaluation, heatmaps
Components: KOHONEN-SOM
Tutorial: Kohonen map with R
Program and dataset: waveform - som
References:
Tanagra tutorial, "Self-organizing map (slides)", June 2017.
Tanagra Tutorial, "Self-organizing map (with Tanagra)", July 2009.

Cluster analysis with Python - HAC and K-Means

2017-07-08T20:17:00.002+02:00

Keywords: python, scipy, scikit-learn, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, principal component analysis, PCA
Turorial: hac and k-means with Python
Dataset and cource code: hac_kmeans_with_python.zip
References :
Marie Chavent, Teaching Page, University of Bordeaux.

Tanagra Tutorials, "Cluster analysis with R - HAC and K-Means", July 2017.

Cluster analysis with R - HAC and K-Means

2017-07-06T17:16:00.001+02:00

This tutorial describes a cluster analysis process. We deal with a set of cheeses (29 instances) characterized by their nutritional properties (9 variables). The aim is to determine groups of homogeneous cheeses in view of their properties.

We inspect and test two approaches using two procedures of the R software: the Hierarchical Agglomerative Clustering algorithm (hclust) ; and the K-Means algorithm (kmeans).

The data file "fromage.txt" comes from the teaching page of Marie Chavent from the University of Bordeaux. The excellent course materials and corrected exercises (commented R code) available on its website will complete this tutorial, which is intended firstly as a simple guide for the introduction of the R software in the context of the cluster analysis.

Keywords: R software, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, fpc package, principal component analysis, PCA
Components: hclust, kmeans, kmeansruns
Turorial: hac and k-means with R
Dataset and cource code: hac_kmeans_with_r.zip
References :
Marie Chavent, Teaching Page, University of Bordeaux.

k-medoids clustering (slides)

2017-07-03T22:35:00.003+02:00

K-medoids is a partitioning-based clustering algorithm. It is related to the k-means but, instead of using the centroid as reference data point for the cluster, we use the medoid which is the individual nearest to all the other points within its cluster. One of the main consequence of this approach is that the resulting partition is less sensible to outliers.

This course material describes the algorithm. Then, we focus on the silhouette tool which can be used to determine the right number of clusters, a recurring open problem in cluster analysis.

Keywords: cluster analysis, clustering, unsupervised learning, paritionning method, relocation approach, medoid, PAM, partitioning aroung medoids, CLARA, clustering large applications, silhouette, silhouette plot
Slides: Cluster analysis - k-medoids algorithm
References:
Wikipedia, "k-medoids".

k-means clustering (slides)

2017-06-20T19:25:00.003+02:00

K-Means clustering is a popular cluster analysis method. It is simple and its implementation does not require to keep in memory all the dataset, thus making it possible to process very large databases.

This course material describes the algorithm. We focus on the different extensions such as the processing of qualitative or mixed variables, fuzzy c-means, and clustering of variables (clustering around latent variables). We note that the k-means method is relatively adaptable and can be applied to a wide range of problems.

Keywords: cluster analysis, clustering, unsupervised learning, partition method, relocation
Slides: K-Means clustering
References :
Wikipedia, "k-means clustering".
Wikipedia, "Fuzzy clustering".

Self-Organizing Map (slides)

2017-06-13T16:03:00.003+02:00

A self-organizing map (SOM) or Kohonen network or Kohonen map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, which preserves the topological properties of the input space (Wikipedia).

SOM is useful for the dimensionality reduction, data visualization and cluster analysis. In this course material, we outline the mechanisms underlying the approach. We focus on its practical aspects (e.g. various visualization possibilities, prediction on a new instance, extension of SOM to the clustering task,…).

Illustrative examples in R (kohonen package) and Tanagra are briefly presented.

Keywords: som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package
Components: KOHONEN-SOM
Slides: Kohonen SOM
References:
Wikipedia, "Self-organizing map".

Hierarchical agglomerative clustering (slides)

2017-06-10T19:12:00.005+02:00

In data mining, cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) (Wikipedia).

In this course material, we focus on the hierarchical agglomerative clustering (HAC). Beginning from the individuals which initially represents groups, the algorithms merge the groups in a bottom-up fashion until only the instances are gathered in only one group. The process is materialized by a dendrogram which allows to evaluate the nature of the solution and helps to determine the appropriate number of clusters.

Examples of analysis under R, Python and Tanagra are described.

Keywords: hac, cluster analysis, clustering, unsupervised learning, tandem analysis, two-step clustering, R software, hclust, python, scipy package
Components: HAC, K-MEANS
Slides: cah.pdf
References:
Wikipedia, "Cluster analysis".
Wikipedia, "Hierarchical clustering".

Support vector machine (slides)

2017-05-20T08:44:00.001+02:00

In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis (Wikipedia).

These slides show the background of the approach in the classification context. We address the binary classification problem, the soft-margin principle, the construction of the nonlinear classifiers by means of the kernel functions, the feature selection process, the multiclass SVM.

The presentation is complemented by the implementation of the approach under the open source software Python (Scikit-Learn), R (e1071) and Tanagra (SVM and C-SVC).

Keywords: svm, e1071 package, R software, Python, scikit-learn package, sklearn
Components: SVM, C-SVC
Slides: Support Vector Machine (SVM)
Dataset: svm exemples.xlsx
References:
Abe S., "Support Vector Machines for Pattern Classification", Springer, 2010.