Tuesday, January 2, 2018

Sparse data file format

The data to be processed with machine learning algorithms are increasing in size. Especially when we need to process unstructured data. The data preparation (e. g. the use of a bag of words representation in text mining) leads to the creation of large data tables where, often, the number of columns (descriptors) is higher than the number of rows (observations). With the singularity that the table contains many zero values. In this context, storing all these zero values into the data file is not opportune. A data compression strategy without loss of information must be implemented, which must remain simple so that the file is readable with a text editor.

In this tutorial, we describe the use of the sparse data file format handled by Tanagra (from the version 1.4.4). It is based on the file format processed by famous libraries for machine learning (svmlight, libsvm, libcvm). We show its use in a text categorization process applied to the Reuters database, well known in data mining. We will observe that the use of this kind of sparse format enables to reduce dramatically the data file size.

Keywords: sparse dataset, dense dataset, attribute-value table, support vector machine, svm, libsvm, c-svc, logistic regression, tr-irls, scoring, roc curve, auc, area under curve
Componets: VIEW DATASET, CONT TO DISC, UNIVARIATE DUISCRETE STAT, SELECT FIRST EXAMPLES, C-SVC, SCORING, ROC CURVE
Tutorial: en_Tanagra_Sparse_File_Format.pdf
Dataset: reuters.data.zip
References:
T. Joachims, "SVMlight: Support Vector Machine".
UCI Repository,  "Reuters-21578 Text Categorization Collection".