Tuesday, November 3, 2009

Sipina - Supported file format

The data access is the first step of the data mining process. It is a crucial step. It is one of the main criteria used when we want to assess the quality of a tool. If we do not able to load a dataset, we cannot perform any kind of analysis. The software is not useable. If the data access is not easy and requires complicated operations, we will devote less time to the other steps of the data exploration.

The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with 4,817,099 instances and 42 variables.

Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.

Keywords: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm
Tutorial: en_Sipina_File_Format.pdf
Dataset: weather.txt and kdd-cup-discretized-descriptors.txt.zip