Dealing with missing values is a difficult problem. The programming in itself is not a problem; we just report the missing value by a specific code. In contrast, the treatment before or during data analysis is very complicated.
Various techniques are available in order to handle missing values into SIPINA. In this tutorial, we show how to implement them; and what are their consequences on the decision tree learning context (C4.5 algorithm; Quinlan, 1993).
Keywords: missing value, missing data, listwise deletion, casewise deletion, data imputation, C4.5, decision tree
Tutorial: en_Sipina_Missing_Data.pdf
Dataset: ronflement_missing_data.zip
References:
P.D. Allison, « Missing Data », in Quantitative Applications in the Social Sciences Series n°136, Sage University Paper, 2002.
J. Bernier, D. Haziza, K. Nobrega, P. Whitridge, « Handling Missing Data – Case Study », Statistical Society of Canada.
D. Garson, "Data Imputation for Missing Values"
Monday, November 9, 2009
Wednesday, November 4, 2009
Model deployment with Sipina
Model deployment is the last step of the Data Mining process. In its simplest form in a supervised learning task, it consists in to apply a predictive model on unlabeled cases.
Applying the model on unseen cases is a very useful functionality. But it would be even more interesting if we could announce its accuracy. Indeed, a misclassification can have dramatic consequences. We must measure the risk we take when we make decisions from a predictive model. An indication about the performance of a classifier is important when we decide or not to deploy it.
In this tutorial, we show how to apply a classifier on unlabeled sample with Sipina. We show also how to estimate the generalization error rate using a resampling scheme such as bootstrap.
Keywords: model deployment, unseen cases, unlabeled instances, decision tree, sipina, linear discriminant analysis
Tutorial: en_sipina_deployment.pdf
Dataset: wine_deployment.xls
References:
Tanagra Tutorials, "Applying a classifier on a new dataset (Deployment)"
Monday, November 2, 2009
Sipina - Supported file format
The data access is the first step of the data mining process. It is a crucial step. It is one of the main criteria used when we want to assess the quality of a tool. If we do not able to load a dataset, we cannot perform any kind of analysis. The software is not useable. If the data access is not easy and requires complicated operations, we will devote less time to the other steps of the data exploration.
The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with 4,817,099 instances and 42 variables.
Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.
Keywords: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm
Tutorial: en_Sipina_File_Format.pdf
Dataset: weather.txt and kdd-cup-discretized-descriptors.txt.zip
The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with 4,817,099 instances and 42 variables.
Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.
Keywords: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm
Tutorial: en_Sipina_File_Format.pdf
Dataset: weather.txt and kdd-cup-discretized-descriptors.txt.zip
Friday, October 30, 2009
Importing Weka file (.arff) into Sipina
WEKA is a very popular Data Mining tool. It supplies a very large of machine learning methods. WEKA can handle various files. But it has a native format (.ARFF) which is a text file with additional specifications.
The text file format is very simple and very easy to manipulate. But, on the other hand, the processing of this kind of file is often slow, slower than binary file format. When we deal with a moderate size file, the text file is enough efficient. The differences between the time processing are not discernible.
In this tutorial, we show how to import the ARFF file format into Sipina. We subdivide the dataset into train and test samples. Then we learn and we assess a decision tree.
Keywords: decision tree, c4.5, file format, data file importation, weka, arff
Tutorial: en_sipina_weka_file_format.pdf
Dataset: ionosphere.arff
References:
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I. Witten, "The Weka Data Mining Software: An Update", SIGKDD Explorations, Vol. 11, Issue 1, 2009.
The text file format is very simple and very easy to manipulate. But, on the other hand, the processing of this kind of file is often slow, slower than binary file format. When we deal with a moderate size file, the text file is enough efficient. The differences between the time processing are not discernible.
In this tutorial, we show how to import the ARFF file format into Sipina. We subdivide the dataset into train and test samples. Then we learn and we assess a decision tree.
Keywords: decision tree, c4.5, file format, data file importation, weka, arff
Tutorial: en_sipina_weka_file_format.pdf
Dataset: ionosphere.arff
References:
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I. Witten, "The Weka Data Mining Software: An Update", SIGKDD Explorations, Vol. 11, Issue 1, 2009.
Wednesday, October 28, 2009
Local sampling for decision tree learning
During the decision tree learning process, the algorithm detects the better variable according to a goodness of fit measure when it tries to split a node. The calculation can take a long time, particularly when it deals with a continuous descriptors for which it must detect the optimal cut point.
For all the decision tree algorithms, Sipina can use a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes.
We have described this approach in a paper (Chauchat and Rakotomalala, IFCS-2000) . We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node enables to reduce dramatically the execution time without loss of accuracy.
We use a version of the WAVEFORM dataset with 21 continuous descriptors and 2,000,000 instances. We obtain the tree in 3 seconds on our computer.
Keywords : decision tree, sampling, large dataset
Components : SAMPLING, ID3, TEST
Tutorial : en_Sipina_Sampling.pdf
Dataset : wave2M.zip
Références :
J.H. Chauchat, R. Rakotomalala, « A new sampling strategy for building decision trees from large databases », Proc. of IFCS-2000, pp. 199-204, 2000.
For all the decision tree algorithms, Sipina can use a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes.
We have described this approach in a paper (Chauchat and Rakotomalala, IFCS-2000) . We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node enables to reduce dramatically the execution time without loss of accuracy.
We use a version of the WAVEFORM dataset with 21 continuous descriptors and 2,000,000 instances. We obtain the tree in 3 seconds on our computer.
Keywords : decision tree, sampling, large dataset
Components : SAMPLING, ID3, TEST
Tutorial : en_Sipina_Sampling.pdf
Dataset : wave2M.zip
Références :
J.H. Chauchat, R. Rakotomalala, « A new sampling strategy for building decision trees from large databases », Proc. of IFCS-2000, pp. 199-204, 2000.
Friday, October 2, 2009
Tanagra - Version 1.4.33
Several logistic regression diagnostics and evaluation tools were implemented, one of them (reliability diagram) can be applied to any supervised method
1.The estimated covariance matrix
2. Hosmer - Lemeshow Test
3. Reliability diagram (says also calibration plot)
4. Analysis of residuals, outilers and influentials points (pearson residuals, deviance residuals, dfichisq, difdev, levier, Cook's distance, dfbeta, dfbetas)
A tutorial describing the utilization of these tools will be available soon.
1.The estimated covariance matrix
2. Hosmer - Lemeshow Test
3. Reliability diagram (says also calibration plot)
4. Analysis of residuals, outilers and influentials points (pearson residuals, deviance residuals, dfichisq, difdev, levier, Cook's distance, dfbeta, dfbetas)
A tutorial describing the utilization of these tools will be available soon.
Monday, September 28, 2009
Using batch mode for Tanagra
For large simulations, it is more convenient to use BATCH mode capabilities of Tanagra rather than opening interactive session. This is the case for instance when we compare the performance of various algorithms on the same dataset; when we try to find automatically the best parameters for a learning method; when we repeat the same treatment on different datasets, etc. In these contexts, it is more useful to save the diagrams in text mode (.TDM file format). It will be easier to handle it outside TANAGRA, with a text editor for instance.
In this tutorial, we want to compare the performances of the naïve bayes classifier with and without the feature selection process. We know that the naïve bayes classifier is highly sensitive to irrelevant features. The goal of this tutorial is to evaluate the efficiency of the FCBF feature selection method in this context.
Keywords: batch mode, supervised learning, naive bayes, feature selection, experiments
Components: NAIVE BAYES, FCBF, CROSS VALIDATION
Tutorial: english_dr_utiliser_tanagra_en_mode_batch.pdf
Dataset: tanagra_batch_execution.zip
In this tutorial, we want to compare the performances of the naïve bayes classifier with and without the feature selection process. We know that the naïve bayes classifier is highly sensitive to irrelevant features. The goal of this tutorial is to evaluate the efficiency of the FCBF feature selection method in this context.
Keywords: batch mode, supervised learning, naive bayes, feature selection, experiments
Components: NAIVE BAYES, FCBF, CROSS VALIDATION
Tutorial: english_dr_utiliser_tanagra_en_mode_batch.pdf
Dataset: tanagra_batch_execution.zip
Subscribe to:
Posts (Atom)