Tanagra - Data Mining and Data Science Tutorials: August 2017

Friday, August 25, 2017

Linear classifiers

In this tutorial, we study the behavior of 5 linear classifiers on artificial data. Linear models are often the baseline approaches in supervised learning. Indeed, based on a simple linear combination of predictive variables, they have the advantage of simplicity: the reading of the influence of each descriptor is relatively easy (signs and values of the coefficients); learning techniques are often (not always) fast, even on very large databases. We are interested in: (1) the naive bayes classifier; (2) the linear discriminant analysis; (3) the logistic regression; (4) the perceptron (single-layer perceptron); (5) the support vector machine (linear SVM).

The experiment was conducted under R. The source code accompanies this document. My idea, besides the theme of the linear classifiers that concerns us, is also to describe the different stages of the elaboration of an experiment for the comparison of learning techniques. In addition, we show also the results provided by the linear approaches implemented in various tools such as Tanagra, Knime, Orange, Weka and RapidMiner.

Keywords: linear classifier, naive bayes, linear discriminant analysis, logistic regression, perceptron, neural network, linear svm, support vector machine, decision tree, rpart, random forest, k-nn, nearest neighbors, e1071 package, nnet package, rf package, class package
Components : NAIVE BAYES CONTINUOUS, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, MULTILAYER PERCEPTRON, SVM
Tutorial: en_Tanagra_Linear_Classifier.pdf
Programs and dataset: linear_classifier.zip
References:
Wikipedia, "Linear Classifier".

Friday, August 18, 2017

Discriminant analysis and linear regression

Linear discriminant analysis and linear regression are both supervised learning techniques. But, the first one is related to classification problems i.e. the target attribute is categorical; the second one is used for regression problems i.e. the target attribute is continuous (numeric).

However, there are strong connections between these approaches when we deal with a binary target attribute. From a practical example, we describe the connections between the two approaches in this case. We detail the formulas for obtaining the coefficients of discriminant analysis from those of linear regression.

We perform the calculations under Tanagra and R.

Keywords: linear discriminant analysis, predictive discriminant analysis, multiple linear regression, wilks' lambda, mahalanobis distance, score function, linear classifier, sas, proc discrim, proc stepdisc
Components: LINEAR DISCRIMINANT ANALYSIS, MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_LDA_and_Regression.pdf
Programs and dataset: lda_regression.zip
References:
C.J. Huberty, S. Olejnik, « Applied MANOVA and Discriminant Analysis »,Wiley, 2006.
R. Tomassone, M. Danzart, J.J. Daudin, J.P. Masson, « Discrimination et Classement », Masson, 1988.

Friday, August 11, 2017

Gradient boosting with R and Python

This tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References).

The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers.

The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with each other. Here, more than for other machine learning methods, the trial and error strategy takes a lot of importance.

We use R and Python with their appropriate packages.

Keywords: gradient boosting, R software, decision tree, adabag package, rpart, xgboost, gbm, mboost, Python, scikit-learn package, gridsearchcv, boosting, random forest
Tutorial: Gradient boosting
Programs and datasets: gradient_boosting.zip
References:
Tanagra tutorial, "Gradient boosting - Slides", June 2016.
Tanagra tutorial, "Bagging, Random Forest, Boosting - Slides", December 2015.
Tanagra tutorial, "Random Forest and Boosting with R and Python", December 2015.

Friday, August 4, 2017

Statistical analysis with Gnumeric

The spreadsheet is a valuable tool for data scientist. This is what the annual KDnuggets polls reveal during these last years where Excel spreadsheet is always well placed. In France, this popularity is largely confirmed by its almost systematic presence in job postings related to the data processing (statistics, data mining, data science, big data/data analytics, etc.). Excel is specifically referred, but this success must be viewed as an acknowledgment of the skills and capabilities of the spreadsheet tools.

This tutorial is devoted to the Gnumeric Spreadsheet 1.12.12. It has interesting features: Setup and installation programs are small because it is not part of an office suite; It is fast and lightweight; It is dedicated to numerical computation and natively incorporates a "statistics" menu with the common statistical procedures (parametric tests, non-parametric tests, regression, principal component analysis, etc.); and, it seems more accurate than some popular spreadsheets programs. These last two points have caught my attention and have convinced me to study it in more detail. In the following, we make a quick overview of Gnumeric's statistical procedures. If it is possible, we compare the results with those of Tanagra 1.4.50.

Keywords: gnumeric, spreadsheet, descriptive statistics, principal component analysis, pca, multiple linear regression, wilcoxon signed rank test, welch test unequal variance, mann and whitney, analysis of variance, anova
Tanagra components: MORE UNIVARIATE CONT STAT, PRINCIPAL COMPONENT ANALYSIS, MULTIPLE LINEAR REGRESSION, WILCOXON SIGNED RANKS TEST, T-TEST UNEQUAL VARIANCE, MANN-WHITNEY COMPARISON, ONE-WAY ANOVA
Tutorial: en_Tanagra_Gnumeric.pdf
Dataset : credit_approval.zip
References :
Gnumeric, "The Gnumeric Manual, version 1.12".

Wednesday, August 2, 2017

Failure resolved

Hi,

It seems that the failure has been resolved since yesterday "August 1st, 2017".

Again, sorry for the inconvenience. I hope that the continuity of service will be ensured throughout the summer.

Kind regards,

Ricco (August 2nd, 2017).