Tanagra - Data Mining and Data Science Tutorials: March 2016

Thursday, March 31, 2016

Dummy coding for categorical predictor variables

In this tutorial, we show how to perform a dummy coding for categorical predictor variables in the context of the logistic regression learning process.

In fact, this is an old tutorial that I was written a long time ago (2007), but it is not referenced in this blog (which was created in 2008). I found it in my archives because I plan to write soon a tutorial about the strategies for the selection of categorical variables in logistic regression. I was wondering if I had already written something that may be linked to this subject (the treatment of the categorical predictors in logistic regression) in the past. Obviously, I would have to check most often my archives.

We use Tanagra 1.4.50 in this tutorial.

Keywords: logistic regression, dummy coding, categorical predictor variables
Components: SAMPLING, O_1_BINARIZE, BINARY LOGISTIC REGRESSION, TEST
Tutorial: Dummy coding - Logistic Regression
Dataset: heart-c.xlsx
References:
Wikipedia, "Logistic Regression"

Sunday, March 13, 2016

Cost-Sensitive Learning (slides)

This course material presents approaches for the consideration of misclassification costs in supervised learning. The baseline method is the one for which we do not take into account the costs.

Two issues are studied : the metric used for the evaluation of the classifier when a misclassification cost matrix is provided i.e. the expected cost of misclassification (ECM); some approaches which enable to guide the machine learning algorithm towards the minimization of the ECM.

Keywords: cost matrix, misclassification, expected cost of misclassification, bagging, metacost, multicost
Slides: Cost Sensitive Learning
References:
Tanagra Tutorial, "Cost-senstive learning - Comparison of tools", March 2009.
Tanagra Tutorial, "Cost-sensitive decision tree", November 2008.

Thursday, March 3, 2016

Hyper-threading and solid-state drive

After more than 6 years of good and faithful service, I decided to change my computer. It must be said that the former (Intel Core 2 Quad Q9400 2.66 Ghz - 4 cores - running Windows 7 - 64 bit) began to make disturbing sounds. I am obliged to put music to cover the rumbling of the beast and be able to work quietly.

The choice of the new computer was another matter. I spent the age of the race to the power which is necessarily fruitless anyway, given the rapid evolution of PCs. Nevertheless, I was sensitive to two aspects that I could not evaluate previously: The hyper-threading technology is effective in programming multithreaded algorithms of data mining? The use of temporary files to relieve the memory occupation takes advantage of SSD disk technology?

The new PC runs under Windows 8.1 (I wrote the French version of this tutorial one year ago). The processor is a Core I7 4770S (3.1 Ghz). It has 4 physical cores but 8 logical cores with the hyper-threading technology. The system disk is a SSD. These characteristics allows evaluate to their influences on (1) the implementation of multithreaded version of the linear discriminant analysis described in a previous paper (“Load balanced multithreading for LDA”, September 2013), where the number of threads used can be specified by the user; (2) the use of temporary files for the induction of decision trees algorithm, which enables us to handle very large dataset (“Dealing with very large dataset in Sipina”, January 2010; up to 9,634,198 instances and 41 variables).

In this tutorial, we reproduce the two studies using the SIPINA software. Our goal is to evaluate the behavior of these solutions (multi-threaded implementation, copy of data into temporary files to alleviate the memory occupation) on our new machine which, due to its characteristics, should expressly take advantage of them.

Keywords: hyper-threading, ssd disk, solid-state drive, multithread, multithreading, very large dataset, core i7, sipina, decision tree, linear discriminant analysis, lda
Tutorial: en_Tanagra_Hyperthreading.pdf
References:
Tanagra Tutorial, "Load balanced multithreading for LDA", September 2013.
Tanagra Tutorial, "Dealing with very large dataset in Sipina", January 2010.
Tanagra Tutorial, "Multithreading for decision tree induction", November 2010.