Saturday, October 31, 2009

Importing Weka file (.arff) into Sipina

WEKA is a very popular Data Mining tool. It supplies a very large of machine learning methods. WEKA can handle various files. But it has a native format (.ARFF) which is a text file with additional specifications.

The text file format is very simple and very easy to manipulate. But, on the other hand, the processing of this kind of file is often slow, slower than binary file format. When we deal with a moderate size file, the text file is enough efficient. The differences between the time processing are not discernible.

In this tutorial, we show how to import the ARFF file format into Sipina. We subdivide the dataset into train and test samples. Then we learn and we assess a decision tree.

Keywords: decision tree, c4.5, file format, data file importation, weka, arff
Tutorial: en_sipina_weka_file_format.pdf
Dataset: ionosphere.arff
References:
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I. Witten, "The Weka Data Mining Software: An Update", SIGKDD Explorations, Vol. 11, Issue 1, 2009.

Wednesday, October 28, 2009

Local sampling for decision tree learning

During the decision tree learning process, the algorithm detects the better variable according to a goodness of fit measure when it tries to split a node. The calculation can take a long time, particularly when it deals with a continuous descriptors for which it must detect the optimal cut point.

For all the decision tree algorithms, Sipina can use a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes.

We have described this approach in a paper (Chauchat and Rakotomalala, IFCS-2000) . We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node enables to reduce dramatically the execution time without loss of accuracy.

We use a version of the WAVEFORM dataset with 21 continuous descriptors and 2,000,000 instances. We obtain the tree in 3 seconds on our computer.

Keywords : decision tree, sampling, large dataset
Components : SAMPLING, ID3, TEST
Tutorial : en_Sipina_Sampling.pdf
Dataset : wave2M.zip
Références :
J.H. Chauchat, R. Rakotomalala, « A new sampling strategy for building decision trees from large databases », Proc. of IFCS-2000, pp. 199-204, 2000.

Saturday, October 3, 2009

Tanagra - Version 1.4.33

Several logistic regression diagnostics and evaluation tools were implemented, one of them (reliability diagram) can be applied to any supervised method

1.The estimated covariance matrix
2. Hosmer - Lemeshow Test
3. Reliability diagram (says also calibration plot)
4. Analysis of residuals, outilers and influentials points (pearson residuals, deviance residuals, dfichisq, difdev, levier, Cook's distance, dfbeta, dfbetas)

A tutorial describing the utilization of these tools will be available soon.