Tanagra - Data Mining and Data Science Tutorials: July 2012

Friday, July 6, 2012

CVM and BVM from the LIBCVM toolkit

The Support Vector Machines algorithms are well-known in the supervised learning domain. They are especially appropriate when we handle a dataset with a large number “p” of descriptors . But they are much less efficient when the number of instances “n” is very high. Indeed, a naive implementation is of complexity O(n^3) for the calculation time and O(n^2) for the storing of the values. In consequence, instead of the optimal solution, the learning algorithms often highlight the near-optimal solutions with a tractable computation time .

I recently discovered the CVM (Core Vector Machine) and BVM (Ball Vector Machine) approaches. The idea of the authors is really clever: since only approximate best solutions can be highlighted, their approaches try to resolve an equivalent problem which is easier to handle - the minimum enclosing ball problem in computational geometry - to detect the support vectors. So, we have a classifier which is as efficient as those obtained by the other SVM learning algorithms, but with an enhanced ability to process datasets with a large number of instances.

I found the papers really interesting. They are all the more interesting that all the tools enabling to reproduce the experiments are provided: the program and the datasets. So, all the results shown in the paper can be verified. It contrasts to too numerous papers where some authors flaunt tremendous results but we can never reproduce them.

The CVM and BVM methods are incorporated into the LIBCVM library. This is an extension of the LIBSVM (version 2.85), which is already included into Tanagra. The source code for LIBCVM being available, I compiled it as a DLL (Dynamic-link Library) and I included it also into Tanagra 1.4.44.

In this tutorial, we describe the behavior of the CVM and BVM supervised learning methods on the "Web" dataset available on the website of the authors. We compare the results and the computation time to those of the C-SVC algorithm based on the LIBSVM library.

Keywords: support vector machine, svm, libcvm, cvm, bvm, libsvm, c-svc
Components: SELECT FIRST EXAMPLES, CVM, BVM, C-SVC
Tutorial: en_Tanagra_LIBCVM_library.pdf
Dataset: w8a.txt.zip
References :
I.W. Tsang, A. Kocsor, J.T. Kwok : LIBCVM Toolkit, Version: 2.2 (beta)
C.C Chang, C.J. Lin : LIBSVM -- A Library for Support Vector Machines

Wednesday, July 4, 2012

Revolution R Community 5.0

The R software is a fascinating project. It becomes a reference tool for the data mining process. With the R package system, we can extend its features potentially at the infinite. Almost all existing statistical / data mining techniques are available in R.

But if there are many packages, there are very few projects which intend to improve the R core itself. The source code is freely available. In theory anyone can modify a part or even the whole software. Revolution Analytics proposes an improved version of R. It provides Revolution R Enterprise, it seems (according to their website) that: it improves dramatically the fastness of some calculations; it can handle very large database; it provides a visual development environment with a debugger. Unfortunately, this is a commercial tool. I could not check these features . Fortunately, a community version is available. Of course, I have downloaded the tool to study its behavior.

Revolution R Community is a slightly improved version of the Base R. The enhancements are essentially related to the calculations performances: it incorporates the Intel Math Kernal libary, which is especially efficient for the matrix calculations; it can take advantage also, in some circumstances, from the power of the multi-core processors. Performance benchmarks are available on the editor's website. The results are impressive. But we note that they are based on datasets generated artificially.

In this tutorial, we extend the benchmark to other data mining methods. We analyze the behavior of the Revolution R Community 5.0 - 64 bit version in various contexts: binary logistic regression (glm); linear discriminant analysis (lda from the MASS package); induction of decision trees (rpart from the rpart package); principal component analysis based on two different principles, the first one is based on the calculations of the eigenvalues and eigenvectors from the correlation matrix (princomp), the second one is done by a singular value decomposition of the data matrix (prcomp).

Keywords: R software, revolution analytics, revolution r community, logistic regression, glm, linear discriminant analysis, lda, principal components analysis, acp, princomp, prcomp, matrix calculations, eigenvalues, eignevectors, singular value decomposition, svd, decision tree, cart, rpart
Tutorial: en_Tanagra_Revolution_R_Community.pdf
Dataset: revolution_r_community.zip
References :
Revolution Analytics, "Revolution R Community".

Monday, July 2, 2012

Introduction to SAS proc logistic

In my courses at the University, I use only free data mining tools (R, Tanagra, Sipina, Knime, Orange, etc.) and the spreadsheet applications (free or not). Sometimes, my students ask me if the commercial tools (e.g. SAS which is very popular in France) have different behavior, in terms of how to use, or for the reading of the results. I say them that some of these commercial tools are available on the computers of our department. They can learn how to use them by taking as a starting point the tutorials available on the Web.

But unfortunately, especially in the French language, they are not numerous about the logistic regression. We need a didactic document with clear screenshots which show how to: (1) import a data file into a SAS bank; (2) define an analysis with the appropriate settings; (3) read and understand the results.

In this tutorial, we describe the use of the SAS PROC LOGISTIC (SAS 9.3). We measure its quickness when we handle a moderate sized dataset. We compare the results with those of Tanagra 1.4.43.

Keywords: sas, proc logistic, binary logistic regression
Components: BINARY LOGISTIC REGRESSION
Tutorial: en_Tanagra_SAS_Proc_Logistic.pdf
Dataset: wave_proc_logistic.zip
References :
SAS - "The LOGISTIC Procedure"
Tanagra - "Logistic regression - Software comparison"
Tanagra - "Logistic regression on large dataset"