Tanagra - Data Mining and Data Science Tutorials: January 2018

Wednesday, January 3, 2018

Tanagra website statistics for 2017

The year 2017 ends, 2018 begins. I wish you all a very happy year 2018.

A small statistical report on the website statistics for 2017. All sites (Tanagra, course materials, e-books, tutorials) has been visited 222,293 times this year, 609 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there was 2,33,371 visits (644 daily visits).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, ...

39 new course materials and tutorials were posted online this year: 18 in French language, 21 in English.

The pages containing course materials about Data Science and Programming (R and Python) are the most popular ones. This is not really surprising.

Happy New Year 2018 to all.

Ricco.
Slideshow: Website statistics for 2017

Tuesday, January 2, 2018

Sparse data file format

The data to be processed with machine learning algorithms are increasing in size. Especially when we need to process unstructured data. The data preparation (e. g. the use of a bag of words representation in text mining) leads to the creation of large data tables where, often, the number of columns (descriptors) is higher than the number of rows (observations). With the singularity that the table contains many zero values. In this context, storing all these zero values into the data file is not opportune. A data compression strategy without loss of information must be implemented, which must remain simple so that the file is readable with a text editor.

In this tutorial, we describe the use of the sparse data file format handled by Tanagra (from the version 1.4.4). It is based on the file format processed by famous libraries for machine learning (svmlight, libsvm, libcvm). We show its use in a text categorization process applied to the Reuters database, well known in data mining. We will observe that the use of this kind of sparse format enables to reduce dramatically the data file size.

Keywords: sparse dataset, dense dataset, attribute-value table, support vector machine, svm, libsvm, c-svc, logistic regression, tr-irls, scoring, roc curve, auc, area under curve
Componets: VIEW DATASET, CONT TO DISC, UNIVARIATE DUISCRETE STAT, SELECT FIRST EXAMPLES, C-SVC, SCORING, ROC CURVE
Tutorial: en_Tanagra_Sparse_File_Format.pdf
Dataset: reuters.data.zip
References:
T. Joachims, "SVMlight: Support Vector Machine".
UCI Repository, "Reuters-21578 Text Categorization Collection".