Tanagra - Data Mining and Data Science Tutorials: January 2014

Wednesday, January 15, 2014

Scilab and R - Performance comparison

We have studied the Scilab tool in a data mining scheme in a previous tutorial . We noted that Scilab is well adapted for data mining. It is a credible alternative to R. But, we observed also that the available toolboxes for statistical processing and data mining are not very numerous compared to those of R. In this second tutorial, we evaluate the behavior of Scilab when we deal with a dataset with 500,000 instances and 22 attributes. We compare its performances with those of R. Two criteria are used: the memory occupation measured in the Windows task manager; the execution time at each step of the process.

It is not possible to obtain an exhaustive point of view. To delimit the scope of our study, we have specified a standard supervised learning scenario: loading a data file, building the predictive model with linear discriminant analysis approach, calculating the confusion matrix and resubstitution error rate. Of course, this study is incomplete. But it seems that Scilab is less efficient in the data management step. It is however quite efficient in the modeling step. This last assessment depends on the toolbox used.

Keywords: scilab, toolbox, nan, linear discriminant analysis, R software, sipina, tanagra
Tutorial: en_Tanagra_Scilab_R_Comparison.pdf
Dataset: waveform_scilab_r.zip
References:
Scilab - https://www.scilab.org/en
Michaël Baudin, "Introduction to Scilab (in French)", Developpez.com.

Tuesday, January 7, 2014

Data Mining with Scilab

I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical data processing and data mining. Recently a mathematician colleague spoke to me about this tool. He was surprised about the low visibility of Scilab within the data mining community, knowing that it proposes functionalities which are quite similar to those of R software. I confess that I did not know Scilab from this perspective. I decided to study Scilab by setting a basic goal: is it possible to perform simply a predictive analysis process with Scilab? Namely: loading a data file (learning sample), building a predictive model, obtaining a description of its characteristics, loading a test sample, applying the model on this second set of data, building the confusion matrix and calculating the test error rate.

We will see in this tutorial that the whole task has been completed successfully easily. Scilab is perfectly prepared to fulfill statistical treatments. But two small drawbacks appear during the catch in hand of Scilab: the library of statistical functions exists but it is not as comprehensive as that of R; their documentation is not very extensive at this time. However, I am very satisfied of this first experience. I discovered an excellent free tool, flexible and efficient, very easy to take in hand, which turns out a credible alternative to R in the field of data mining.

Keywords: scilab, toolbox, nan, libsvm, linear discriminant analysis, R software, predictive analytics
Tutorial : en_Tanagra_Scilab_Data_Mining.pdf
Dataset : data_mining_scilab.zip
References :
Scilab - https://www.scilab.org/fr
ATOMS : Homepage - http://atoms.scilab.org/

Thursday, January 2, 2014

Tanagra, tenth anniversary

First of all, let me introduce you to all my wishes of happiness, health and success for the year 2014 which begins.

For Tanagra, 2014 is of quite particular importance. 10 Years ago almost to the day, the first version of the software has been put on line. Designed originally as a tool for the students and researchers in the data mining domain, the project has changed a bit of nature in recent years. Today, Tanagra is an academic project which provides a point of access to the statistical and the data mining techniques. It is addressed to students, but also to the researchers of other areas (psychology, sociology, archeology, etc.). It allows, I hope, make it more attractive, more clear, the implementation of these techniques on real case studies.

This mutation was accompanied by a refocusing of my activity. The Tanagra software is still evolving (we are at version 1.4.50), new methods are added, existing components are regularly improved, but at the same time I put emphasis on the documentation in the form of books, training materials and tutorials. The underlying idea is very simple: understanding the ins and outs of the methods is the best way to learn how to use software which proposes them.

Over the past 5 years (2009/01/01 to 2013/12/31), my site gets 677 visits per day. The 10 countries that come most often are: France, Morocco, Algeria, Tunisia, United States, India, Canada, Belgium, United Kingdom and Brazil. The page of the materials for my data mining courses is the most visited (http://eric.univ-lyon2.fr/~ricco/cours/supports_data_mining.html; 99 visits per day, 6 minutes 35 seconds average time spent on the page). At the same time, I note with great satisfaction that the English pages are overall as much visited as those in French. I think that the effort to write documentation in English is fruitful.

I hope that this work will be useful for a long time, and that 2014 will be the opportunity of exchanges always so rewarding for everybody.

Ricco.