Tanagra - Data Mining and Data Science Tutorials

Sunday, July 17, 2011

Tanagra add-on for OpenOffice Calc 3.3

Tanagra add-on for OpenOffice 3.3 and LibreOffice 3.4.

The connection with spreadsheet applications is certainly a factor of success for Tanagra. It is easy to manipulate a dataset into OpenOffice Calc (up to version 3.2) and send it to Tanagra using the TanagraLibrary.zip extension for further analysis .

Recently, users have reported to me that the mechanism did not work with recent versions of OpenOffice (version 3.3) and LibreOffice (version 3.4). I realized that, rather than a correction, it was more appropriate to elaborate a new module which meets the standard for managing extensions of these tools. The new library "TanagraModule.oxt" is now incorporated into the distribution.

This tutorial describes how to install and to use this add-on under OpenOffice Calc 3.0. The adaptation to LibreOffice 3.4 is very easy.

Keywords : data importation, spreadsheet application, openoffice, libreoffice, add-in, add-on, excel
Component : View Dataset
Tutorial : en_Tanagra_Addon_OpenOffice_LibreOffice.pdf
Dataset : breast.ods
Références :
Tutoriel Tanagra, "OOo Calc file handling using an add-in"
Tutoriel Tanagra, "Launching Tanagra from OOo Calc under Linux"

Tuesday, July 5, 2011

Tanagra - Version 1.4.40

Few improvements for this new version.

A new addon for the connection between Tanagra and the recent version of OpenOffice Calc spreadsheet has been created. The old one did not work for recent versions - OpenOffice 3.3 and LibreOffice 3.4. During the installation process, another library was added ("TanagraModule.oxt") to not interfere with the old, still functional for previous versions of Open Office (3.2 and earlier). A tutorial describing its installation and its utilization will be put online soon. I take this opportunity to highlight again how a privileged connection between a spreadsheet and a specialized tool for Data Mining is convenient. The annual poll organized by the kdnuggets.com website shows the interest of this connection (2011, 2010, 2009,...). We note that there is a similar addon for the R software (R4Calc). This change was suggested by Jérémy Roos (OpenOffice) and Franck Thomas (LibreOffice).

The non-standardized ACP is now available. It is possible to implement unchecking the option of standardization of the data in the Principal Component Analysis component. Change suggested by Elvire Antanjan.

Simultaneous regression was introduced. It is very similar to the method programmed into LazStats, which is unfortunately more accessible freely now. The approach is described in a free booklet online "Practice of linear regression analysis" (in French) (section 3.6).

The color codes according to the p-value have been introduced for the Linear Correlation component. Change suggested by Samuel KL.

Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.

Donwload page : setup

Thursday, May 26, 2011

Tanagra - Version 1.4.39

Some minor corrections for the Tanagra 1.4.39 version.

For the PCA (principal component analysis) component, when we ask all the factors, none are generated. Reported by Jérémy Roos.

In the previous 1.4.38 version, the results of Multinomial Logistic Regression are not consistent with the tutorial on the website. The calculations are wrong. Reported by Nicole Jurado.

It is now possible to obtain the scores from the PLS-DA component (Partial Least Squares Regression - Discriminant Analysis). Reported by Carlos Serrano.

All these bugs are corrected in the 1.4.39 version. Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.

Donwload page : setup

Tuesday, April 5, 2011

Mining Association Rule from Transactions File

Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain. But in fact, it can be implemented in various areas where we want to discover the associations between variables. The association is described by a "IF THEN" rule. The IF part is called "antecedent" of the rule; the THEN part correspond to the "consequent" e.g. IF onions AND potatoes THAN burger (http://en.wikipedia.org/wiki/Association_rule_learning) i.e. if a customer buys onions and potatoes then he buys also burger.

It is possible to find co-occurrences in the standard attribute - value tables that are handled with the most of the data mining tools. In this context, the rows correspond to the baskets (transactions); the columns correspond to the list of all possible products (items); at the intersection of the row and the column, we have an indicator (true/false or 1/0) which indicates if the item belongs to the transaction. But this kind of representation is too naive. A few products are incorporated in each basket. Each row of the table contains a few 1 and many 0. The size of the data file is unnecessarily excessive. Therefore, another data representation, says "transactions file", is often used to minimize the data file size. In this tutorial, we treat a special case of the transactions file. The principle is based on the enumeration of the items included in each transaction. But in our case, we have only two values for each row of the data file: the transaction identifier, and the item identifier. Thus, each transaction can be listed on several rows of the data file.

This data representation is quite natural considering the problem we want to treat. It also has the advantage of being more compact since only the items really present in each transaction are enumerated. However, it appears that many tools do not know manage directly this kind of data representation. We observe curiously a distinction between professional tools and the academic ones. The first ones can handle directly this kind of data file without special data preparation. This is the case of SPAD 7.3 and SAS Enterprise Miner 4.3 that we study in this tutorial. On the other hand, the academic tools need a data transformation, prior the importation of the dataset. We use a small program written in VBA (Visual Basic for Applications) under Excel to prepare the dataset. Thereafter, we perform the analysis with Tanagra 1.4.37 and Knime 2.2.2 (Note: a reader told me that we can transform the dataset with Knime without the utilization of external program. This is true. I will describe this approach in a separate section at the end of this tutorial).

Attention, we must respect the original specifications i.e. focus only on rules indicating the simultaneous presence of items in transactions. We must not, consecutively to a bad "presence - absence" coding scheme, to generate rules outlining the simultaneous absence of some items. This may be interesting in some cases may be, but this is not the purpose of our analysis.

Keywords: association rules, a priori algorithm
Components: A priori
Tutorial: en_Tanagra_Assoc_Rule_Transactions.pdf
Dataset: assoc_rule_transactions.zip
References:
Tanagra Tutorials, "Association rule learning from transaction dataset"
P.N. Tan, M. Steinbach, V. Kumar, « Introduction to Data Mining », Addison Wesley, 2006 ; chapitre 6, « Association analysis : Basic Concepts and Algorithms ».
Wikipedia - "Association rule learning"

Sunday, February 20, 2011

Multiple Regression - Reading the results

The aim of the multiple regression is to predict the values of a continuous dependent variable Y from a set of continuous or binary independent variables (X1,..., Xp).

In this tutorial, we want to model the relationship between the cars consumption and their weight, engine-size and horsepower. We describe the outputs of Tanagra by associating them with the used formulas. We highlight the importance of the unscaled covariance matrix of the estimated coefficients [(X'X)-1] (Tanagra 1.4.38 and later). It is used for the subsequent analysis: individual significance of coefficients, simultaneous significance of several coefficients, testing linear combinations of coefficients, computation of the standard error for the prediction interval. These analyses are performed into the Excel spreadsheet.

Thereafter, we perform the same analyses with the R software. We identify the objects provided by the lm(.) procedure that we can use in the same context.

Keywords: linear regression, multiple regression, R software, lm, summary.lm, testing significance, prediction interval
Components: MULTIPLE LINEAR REGRESSION
Tutorial: en_Tanagra_Multiple_Regression_Results.pdf
Dataset: cars_consumption.zip
References :
D. Garson, "Multiple Regression"

Friday, February 4, 2011

Tanagra - Version 1.4.38

Some minor corrections for the Tanagra 1.4.38 version.

The color codes for the normality tests have been harmonized (Normality Test). In some configurations, the colors associated with p-values were not consistent, it could misleading the users. This problem has been reported by Lawrence M. Garmendia.

Following indications from Mr. Oanh Chau, I realized that the standardization of variables to the HAC (hierarchical agglomerative clustering) was based on the sample standard deviation. This is not an error in itself. But the sum of index of level into the dendrogram does not consistent with the TSS (total sum of squares). This is unwelcome. The difference is especially noticeable on small dataset, it disappears when the dataset size increases. The correction has been introduced. Now the BSS ratio is equal to 1 when we have the trivial partition i.e. one individual per group.

Multiple linear regression (MULTIPLE LINEAR REGRESSION) displays the matrix (X'X) ^ (-1). It allows to deduce the variance covariance matrix of coefficients (by multiplying the matrix by the estimated variance of the error). It can be also used in the generalized tests for the model coefficients.

Last, the outputs of the descriptive discriminant analysis (CANONICAL DISCRIMINANT ANALYSIS) were improved. The group centroids (Group centroids) on the factorial axes are directly provided.

Thank you very much to all those who help me to improve this work by their comments or suggestions.

Download page: setup

Tuesday, January 4, 2011

Tanagra website statistics for 2010

The year 2010 ends, 2011 begins. I wish you all a very happy year 2011.

A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 241,765 times this year, 662 visits per day. For comparison, we had 520 daily visits in 2009 and 349 in 2008.

Who are you? The majority of visits come from France and Maghreb (62%). Then there are a large part of French speaking countries. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, Brazil,...

Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Mining: course materials, tutorials, links to other documents available on line, etc.. This is hardly surprising. I take more time myself to write booklets and tutorials, to study the behavior of different software, of which Tanagra.

Happy New Year 2011 to all.

Ricco.
Slideshow: Website statistics for 2010

Thursday, December 9, 2010

Creating reports with Tanagra

The ability to create automatically reports from the results of an analysis is a valuable functionality for Data Mining. But this is rather an asset to the professional tools. The programming of this kind of functionality is not really promoted in the academic domain. I do not think that I can publish a paper in a journal where I describe the ability of Tanagra to create attractive reports. This is the reason for which the output of the academic tools, such as R or Weka, is mainly in a formatted text shape.

Tanagra, which is an academic tool, provides also text outputs. The programming remains simple if we see at a glance the source code. But, in order to make the presentation more attractive, it uses the HTML to format the results. I take advantage of this special feature to generate reports without making a particular programming effort. Tanagra is one of the few academic tools to be able to produce reports that can easily be displayed in office automation software. For instances, the tables can be copied into Excel spreadsheets for further calculations. More generally, the results can be viewed in a browser, regardless of data mining software.

These are the reporting features of Tanagra that we present in this tutorial.

Keywords: reporting, decision tree, c4.5, logistic regression, binary coding, roc curve, learning sample, test sample, forward, feature selection
Components: GROUP CHARACTERIZATION, SAMPLING, C4.5, TEST, O_1_BINARIZE, FORWARD-LOGIT, BINARY LOGISTIC REGRESSION, SCORING, ROC CURVE
Tutorial: en_Tanagra_Reporting.pdf
Dataset: heart disease

Wednesday, November 24, 2010

Multithreading for decision tree induction

Nowadays, much of modern personal computers (PC) have multicore processors. The computer operates as if it had multiple processors. Software and data mining algorithms must be modified in order to benefit of this new feature.

Currently, few free tools exploit this opportunity because it is impossible to define a generic approach that would be valid regardless of the learning method used. We must modify each existing learning algorithm. For a given technique, decomposing an algorithm into elementary tasks that can execute in parallel is a research field in itself. In a second step, we must adopt a programming technology which is easy to implement.

In this tutorial, I propose a technology based on threads for the induction of decision trees. It is well suited in our context for various reasons. (1) It is easy to program with the modern programming languages. (2) Threads can share information; they can also modify common objects. Efficient synchronization tools enable to avoid data corruption. (3) We can launch multiple threads on a mono-core and mono-processor system. It is not really advantageous, but at least the system does not crash. (4) On a multiprocessor or multi-core system, the threads will actually run at the same time, with each processor or core running a particular thread. But, because of the necessity of synchronization between threads, the computation time is not divided by the number of cores in this case.

First, we briefly present the modification of the decision tree learning algorithm in order to benefit of the multithreading technology. Then, we show how to implement the approach with SIPINA (version 3.5 and later). We show also that the multithreaded decision tree learners are available in various tools such as Knime 2.2.2 or RapidMiner 5.0.011. Last, we study the behavior of the multithreaded algorithms according to the dataset characteristics.

Keywords: multithreading, thread, threads, decision tree, chaid, sipina 3.5, knime 2.2.2, rapidminer 5.0.011
Tutorial: en_sipina_multithreading.pdf
Dataset: covtype.arff.zip
References :
Wikipedia, "Decision tree learning"
Wikipedia, "Thread (Computer science)"
Aldinucci, Ruggieri, Torquati, " Porting Decision Tree Algorithms to Multicore using FastFlow ", Pkdd-2010.

Thursday, November 11, 2010

Naive bayes classifier for continuous predictors

The naive bayes classifier is a very popular approach even if it is (apparently) based on an unrealistic assumption: the distributions of the predictors are mutually independent conditionally to the values of the target attribute. The main reason of this popularity is that the method proved to be as accurate as the other well-known approaches such as linear discriminant analysis or logistic regression on the majority of the real dataset.

But an obstacle to the utilization of the naive bayes classifier remains when we deal with a real problem. It seems that we cannot provide an explicit model for its deployment. The proposed representation by the PMML standard for instance is particularly unattractive. The interpretation of the model, especially the detection of the influence of each descriptor on the prediction of the classes is impossible.

This assertion is not entirely true. We have showed in a previous tutorial that we can extract an explicit model from the naive bayes classifier in the case of discrete predictors (see references). We obtain a linear combination of the binarized predictors. In this document, we show that the same mechanism can be implemented for the continuous descriptors. We use the standard Gaussian assumption for the conditional distribution of the descriptors. According to the heteroscedastic assumption or the homoscedastic assumption, we can provide a quadratic model or a linear model. This last one is especially interesting because we obtain a model that we can directly compare to the other linear classifiers (the sign and the values of the coefficients of the linear combination).

This tutorial is organized as follows. In the next section, we describe the approach. In the section 3, we show how to implement the method with Tanagra 1.4.37 (and later). We compare the results to those of the other linear methods. In the section 4, we compare the results provided by various data mining tools. We note that none of them proposes an explicit model that could be easy to deploy. They give only the estimated parameters of the conditional Gaussian distribution (mean and standard deviation). Last, in the section 5, we show the interest of the naive bayes classifier over the other linear methods when we handle a large dataset (the "mutant" dataset - 16,592 instances and 5,408 predictors). The computation time and the memory occupancy are clearly advantageous.

Keywords: naive bayes classifier, rapidminer 5.0.10, weka 3.7.2, knime 2.2.2, R software, package e1071, linear discriminant analysis, pls discriminant analysis, linear svm, logistic regression
Components : NAIVE BAYES CONTINUOUS, BINARY LOGISTIC REGRESSION, SVM, C-PLS, LINEAR DISCRIMINANT ANALYSIS
Tutorial: en_Tanagra_Naive_Bayes_Continuous_Predictors.pdf
Dataset: breast ; low birth weight
References :
Wikipedia, "Naive bayes classifier"
Tanagra, "Naive bayes classifier for discrete predictors"

Tuesday, October 19, 2010

Tanagra - Version 1.4.37

Naive Bayes Continuous is a supervised learning component. It implements the naive bayes principle for continuous predictors (gaussian assumption, heteroscedasticity or homoscedasticity). The main originality is that it provides an explicit model corresponding to a linear combination of predictors and, eventually, their square.

Enhancement of the reporting module.

Thursday, October 14, 2010

Filter methods for feature selection

The nature of the predictors' selection process has changed considerably. Previously, works in machine learning concentrated on the research of the best subset of features for a learning classifier, in the context where the number of candidate features was rather reduced and the computing time was not a major constraint. Today, it is common to deal with datasets comprising thousands of descriptors. Consequently, the problem of feature selection always consists in finding the most relevant subset of predictors but by introducing a new strong constraint: the computing time must remain reasonable.

In this tutorial, we are interested in correlation based filter approaches for discrete predictors. The goal is to highlight the most relevant subset of predictors which are highly correlated with the target attribute and, in the same time, which are weakly correlated between them i.e. which are not redundant. To evaluate the behavior of the various methods, we use an artificial dataset where we add irrelevant and redundant candidate variables. Then, we perform a feature selection based on the approaches analyzed. We compare the generalization error rate of the naive bayes classifier learned from the various subsets of selected variables. We lead the experimentation with Tanagra in a first time. Then, in a second time, we show how to perform the same analysis with other tools (Weka 3.6.0, Orange 2.0b, RapidMiner 4.6.0, R 2.9.2 - package FSelector).

Keywords: filter, feature selection, correlation based measure, discrete predictors, naive bayes classifier, bootstrap
Components: FEATURE RANKING, CFS FILTERING, MIFS FILTERING, FCBF FILTERING, MODTREE FILTERING, NAIVE BAYES, BOOTSTRAP
Tutorial: en_Tanagra_Filter_Method_Discrete_Predictors.pdf
Dataset: vote_filter_approach.zip
References:
Tanagra, "Feature Selection"

Monday, August 30, 2010

Connecting Sipina and Excel using OLE

The connection between a data mining tool and Excel (and more generally spreadsheet) is a very important issue. We had addressed many times this topic in our tutorials. With hindsight, I think the solution based on add-ins for Excel is the best one, both for SIPINA and for TANAGRA. It is simple, reliable and highly efficient. It does not require developing specific versions. The connection with Excel is a simple additional functionality of the standard distribution.

Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina using OLE is prohibitive when the database size grows.

Thus, XL-SIPINA is essentially an attempt short-lived. There is always a bit of nostalgia when I am back on solutions I have explored, and I have finally abandoned. Can be also I have not completely explored this solution.

Last, the application was initially developed for Office 97. I note that it still up to date today, it works fine with Office 2010.

Keywords: excel, tableur, sipina, xls, xlsx, xl-sipina, decision tree induction
Download XL-SIPINA: XL-SIPINA
Tutorial: en_xls_sipina.pdf
Dataset: autos

Friday, August 27, 2010

Sipina add-in for Excel

The data importation is a bottleneck for Data Mining Tools. The majority of users are working with a spreadsheet tool such as Excel, mainly in the coupling with specialized software for data mining (see KDnuggets polls). Therefore, a recurring issue for users is "how to send my data from Excel to SIPINA?"

It is possible to import different types of formats into SIPINA. About Excel workbooks, one particular device has been implemented.

An add-in is automatically copied to the computer during the installation process. It must be integrated into Excel. The add-in incorporates a new menu into Excel. After selecting the data range, the user only has to activate it, this leads to the following: (1) SIPINA starts automatically, (2) the data are transferred via the clipboard and (3) SIPINA considers the first row of the range of cells corresponds to the names of variables, (4) columns with numerical values of the variables are quantitative (5) columns with alphanumeric values are categorical variables.

Unlike the other tutorials, the sequence of manipulations is described in a video. The description is right only for the versions up to Excel 2003. Another tutorial about the using of the add-in under Office 2007 and Office 2010 is described below.

Keywords: excel file format, add-in, decision tree
Installing the add-in : sipina_xla_installation.htm
Using the add-in: sipina_xla_processing.htm

Tanagra add-in for Office 2007 and Office 2010

The "tanagra.xla" add-in for Excel contributes to the wide diffusion of Tanagra. The principle is simple. It is to embed a Tanagra menu in Excel. Thus the user can run statistical calculations without having to leave the spreadsheet. It seems simplistic. But this feature facilitates immensely the work of data miner. Indeed, the spreadsheet is one of the most used tools for preparing dataset (see KDNuggets Polls: Tools / Languages for Data Cleaning - 2008). By embedding the data mining tool in the spreadsheet environment, it avoids to the practitioner the tedious and repetitive manipulations: importing the dataset, exporting the dataset, checking the compatibilities between data file formats, etc.

The installation and the use of the "tanagra.xla" add-in under the previous versions of Office are described elsewhere (Office 1997 to Office 2003). This description is obsolete for the latest version of Office because the organization of the menus is modified for these versions i.e. Office 2007 and Office 2010. And yet, the add-in is still operational. In this tutorial, we show how to install and to use the Tanagra add-in under Office 2007 and 2010.

This transition to recent versions of Excel is absolutely not without consequences. Indeed, compared to the previous Excel versions, Excel 2007 (and 2010) and can handle more important rows and columns. We can process a dataset up to 1,048,575 observations (the first line corresponds to the variable names) and 16,384 variables. In this tutorial, we will treat a database with 100,000 observations and 22 variables (wave100k.xlsx). This is a version of the famous waveform database. Note that this file, because of the number of rows, cannot be manipulated by earlier versions of Excel.

The process described in this document is also valid for the SIPINA add-in (sipina.xla).

Keywords: data importation, excel, add-in
Components: VIEW DATASET
Tutorial: en_Tanagra_Add_In_Excel_2007_2010.pdf
Dataset: wave100k.xlsx
References:
Tanagra, "Tanagra and Sipina add-ins for Excel 2016", June 2016.
Tanagra, "Excel file handling using an add-in".
Tanagra, "OOo Calc file handling using an add-in".
Tanagra, "Launching Tanagra from OOo Calc under Linux".
Tanagra, "Sipina add-in for Excel"