Tanagra - Data Mining and Data Science Tutorials: October 2008

Friday, October 31, 2008

Performance evaluation using a predefined test set

TANAGRA, ORANGE and WEKA: Comparison of learning algorithms using a predefined learning and test set.

Very often, we use the accuracy to compare the performances of the algorithms. We then select the method that is the most accurate. So that the comparison is rigorous, it is necessary that we use the same dataset in training and test phase.

We show in this tutorial, how to implement this process in three data mining software: TANAGRA, ORANGE and WEKA. We chose to compare the performances of a SVM (linear
kernel), a logistic regression and a decision tree.

Keywords: supervised learning, decision tree, svm, logistic regression, classifier assessment, train and test set, orange, weka
Components: Select examples, Supervised learning, Binary logistic regression, C-RT, C-SVC, Test
Tutorial: en_Tanagra_TOW_Predefined_Test_Set.pdf
Dataset: breast_tow.zip

Learning classification tree - Software comparison

Learning decision tree with TANAGRA, ORANGE and WEKA. Estimation of the error rate using a cross-validation.

When we build a decision tree from a dataset, we much follow the following steps (not necessarily in the same order):
• Import the dataset in the software;
• Select the class attribute (TARGET) and the descriptors (INPUT);
• Choose the induction algorithm, according the implementation we can obtain slightly different results;
• Learning process and viewing the decision tree;
• Use cross-validation in order to obtain an honest error rate estimate.

In this tutorial, we show how to perfom these operations using various free software.

Keywords: supervised learning, decision tree, classification tree, classifier assessment, performance evaluation, resampling method, cross-validation, orange, weka
Components: Supervised learning, C-RT, Cross validation
Tutorial: en_Tanagra_TOW_Decision_Tree.pdf
Dataset: heart.txt
References:
R. Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (tutoriel_arbre_revue_modulad_33.pdf) (fr)
Wikipedia, "Decision tree learning"

Computing ROC Curve

TANAGRA, ORANGE and WEKA are free data mining softwares. They represent the succession of treatments as a stream diagram or a knowledge flow. Sometimes, there is a little difference between these softwares. Nevertheless, we show that in spite of these differences, these softwares often handle the same problems and give a very similar presentation of results. In this tutorial, we try to build a roc curve from a logistic regression.

Regardless the software we used, even for commercial software, we have to prepare the following steps when we want build a ROC curve.
• Import the dataset in the soft;
• Compute descriptive statistics;
• Select target and input attributes;
• Select the “positive” value of the target attribute;
• Split the dataset into learning (e.g. 70%) and test set (30%);
• Choose the learning algorithm. Be careful, the softwares can have different
implementation and present a slightly different results;
• Build the prediction model on the learning set and visualize the results;
• Build the ROC curve on the test set.

According the softwares, the progression can be different but it is clear that we must, explicitly or not, process these steps.

Keywords: supervised learning, roc curve, classifier assessment, orange, weka, training set, learning set, test set
Components: MORE UNIVARIATE CONT STAT, SAMPLING, SUPERVISED LEARNING, LOG-REG TRIRLS, SCORING, ROC CURVE
Tutorial: en_Tanagra_Orange_Weka_Roc_curve.pdf
Dataset: ds1_10.zip
References:
R. Rakotomalala – « Courbe ROC » (fr)
T. Fawcet – « ROC Graphs : Notes and Practical Considerations of Researchers »

Handling a weka file format (*.arff)

WEKA is a popular free data mining software. It includes a large number of methods, mainly articulated around supervised and unsupervised approaches.

WEKA has a proprietary file format (*. ARFF), which is a text format, with specifications for ad hoc variables documentation. Import ARFF file is easy, since we know how to handle a text file.

In this tutorial, we show how to import a ARFF file into TANAGRA. When there are missing values, very simple substitution strategies are used: the average for continuous variables, a new value is added for discrete variables.

Treatments can start normally, a diagram is automatically created. If we decide to save TDM format, the reference file is saved. At the next loading of diagram, importation of the ARFF file ARFF is done automatically without specific manipulation.

Keywords: WEKA, ARFF file format, data file importation
Components: DATASET
Tutorial: en_Tanagra_Handle_WEKA_File.pdf
Dataset: sick.arff

Import tab-separated text file (csv)

TANAGRA can handle text file format. Columns are delimited by tabulations. The first line corresponds to the names of variables. In this tutorial, we show how to: (1) preparing this type of file, from a spreadsheet for instance; (2) import the data by creating a new diagram in Tanagra.

Caution, the decimal point of continuous variables depends on the configuration of your system. Tanagra does not handle also missing values.

The use of text files is to be preferred when processing big dataset i.e. in the order of several hundreds of thousands of rows. For moderate-sized files (several tens of thousands of individuals), it is better to use Excel format.

Keywords : data file importation, text file format
Components : Dataset
Tutorial : enImportDataset.pdf
Dataset : weather.xls

OOo Calc file handling using an add-in

In the same way that it is possible to transfer data directly into EXCEL using an add-in (see Excel Add-in), we have developed an add-in for the Calc spreadsheet Open Office (Open Office Calc).

The procedure is the same. When installing TANAGRA, the add-in is automatically installed on the disk. We have to plug in Open Office Calc following a specific procedure. The add-in is available since version 1.4.12 TANAGRA. The development and testing have been made with version 2.1.0 (French) Open Office.

This tutorial describes how to install the add-in Open Office Calc. Two documents are available: (1) classical pdf describes steps with screen shots (2) an animated tutorial (Adobe Flash format, in English, but the menus are all in French).

Keywords : data file importation, open office calc spreadsheet, add-in
Components : View Dataset
Tutorial : en_Tanagra_OOoCalc_Addon.pdf
Animated tutorial : from_OOoCalc_To_Tanagra.htm
Dataset : breast.ods

Excel file format - Direct importation

TANAGRA can import a XLS file format in a native mode i.e. without it being necessary that the Excel spreadsheet software is present on your computer, XLS version 1997 to 2003 (for Excel 2007, it asked for confirmation).

Some minor restrictions on the workbook configuration: (1) data must be located in the first worksheet in the workbook; (2) there should not be left empty columns of data, or blank lines above the data, in other words, the data must be aligned with the first row and first column; (3) TANAGRA considers that the first line contains the names of variables.

In this tutorial, we show how to create a new diagram by directly importing an Excel.

Keywords : data file importation, excel spreadsheet
Components : Group characterization
Tutorial : en_Tanagra_Handle_Spreadsheet_File.pdf
Dataset : adult_dataset.zip

Excel file handling using an add-in

Starting from the Tanagra's 1.4.11 version, a new EXCEL add-in is available. It enables to define a data mining analysis directly from EXCEL spreadsheet without closing the EXCEL session.

The main asset of this functionality is that we can perform all data preparation (data transformation, feature construction, etc.) and basic descriptive statistics (mean, standard deviation, pivot table, etc.) in the spreadsheet. Then we call TANAGRA, from EXCEL, only for advanced machine learning technique.

In this tutorial, we show how to install and use this EXCEL add-in.

Keywords : data file importation, excel format, excel add-in
Components : Univariate Discrete stat
Tutorial : en_Tanagra_Excel_AddIn.pdf
Dataset : add_in_dataset.zip