Tanagra - Data Mining and Data Science Tutorials: April 2010

Monday, April 26, 2010

Linear discriminant analysis on PCA factors

In this tutorial, we show that in certain circumstances, it is more convenient to use the factors computed from a principal component analysis (from the original attributes) as input features for the linear discriminant analysis algorithm.

The new representation space maintains the proximity between the examples. The new features known as "factors" or "latent variables", which are a linear combination of the original descriptors, have several advantageous properties: (a) their interpretation very often allows to detect patterns in the initial space; (b) a very reduced number of factors allows to restore information contained in the data, we can moreover remove the noise from the dataset by using only the most relevant factors (it is a sort of regularization by smoothing the information provided by the dataset); (c) the new features form an orthogonal basis, learning algorithms such as linear discriminant analysis have a better behavior.

This approach has a connection to the reduced-rank linear discriminant analysis. But, instead to this last one, the class information is not needed during the computations of the principal components. The computation can be very fast using an appropriate algorithm when we deal with very high-dimensional dataset (such as NIPALS). But, on the other hand, it seems that the standard reduced-rank LDA tends to be better in terms of classification accuracy.

Keywords: linear discriminant analysis, principal component analysis, reduced-rank linear discriminant analysis
Components: Supervised Learning, Linear discriminant analysis, Principal Component Analysis, Scatterplot, Train-test
Tutorial: en_dr_utiliser_axes_factoriels_descripteurs.pdf
Dataset: dr_waveform.bdm
References:
Wikipedia, "Linear discriminant analysis".

Thursday, April 22, 2010

Induction of fuzzy rules using Knime

This tutorial is the continuation of the one devoted to the induction of decision rules (Supervised rule induction - Software comparison). I have not included Knime in the comparison because it implements a method which is different compared with the other tools. Knime computes fuzzy rules. It wants that the target variable is continuous. That seems rather mysterious in the supervised learning context where the class attribute is usually discrete. I thought it was more appropriate to detail the implementation of the method in a tutorial that is exclusively devoted to the Knime rule learner (version 2.1.1).

Especially, it is important to detail the reason of the data preparation and the reading of the results. To have a reference, we compare the results with those provided by the rule induction tool proposed by Tanagra.

Scientific papers about the method are available on line.

Keywords: induction of rules, supervised learning, fuzzy rules
Components: SAMPLING, RULE INDUCTION, TEST
Tutorial: en_Tanagra_Induction_Regles_Floues_Knime.pdf
Dataset: iris2D.txt
References :
M.R. Berthold, « Mixed fuzzy rule formation », International Journal of Approximate Reasonning, 32, pp. 67-84, 2003.
T.R. Gabriel, M.R. Berthold, « Influence of fuzzy norms and other heuristics on mixed fuzzy rule formation », International Journal of Approximate Reasoning, 35, pp.195-202, 2004.

Friday, April 16, 2010

"Wrapper" for feature selection (continuation)

This tutorial is the continuation of the preceding one about the wrapper feature selection in the supervised learning context (http://data-mining-tutorials.blogspot.com/2010/03/wrapper-for-feature-selection.html). We analyzed the behavior of Sipina, and we have described the source code for the wrapper process (forward search) under R (http://www.r-project.org/). Now, we show the utilization of the same principle under Knime 2.1.1, Weka 3.6.0 and RapidMiner 4.6.

The approach is as follows: (1) we use the training set for the selection of the most relevant variables for classification; (2) we learn the model on selected descriptors; (3) we assess the performance on a test set containing all the descriptors.

This third point is very important. We cannot know the variables that will be finally selected. We do not have to manually prepare the test file by including only those which have been selected by the wrapper procedure. This is essential for the automation of the process. Indeed, otherwise, each change of setting in the wrapper procedure leading to another subset of descriptors would require us to manually edit the test file. This is very tedious.

In the light of this specification, it appeared that only Knime was able to implement the complete process. With the other tools, it is possible to select the relevant variables on the training file. But, I could not (or I did not know) apply the model on a test file containing all the original variables.

The naive bayes classifier is the learning method used in this tutorial .

Keywords: feature selection, supervised learning, naive bayes classifier, wrapper, knime, weka, rapidminer
Tutorial: en_Tanagra_Wrapper_Continued.pdf
Dataset: mushroom_wrapper.zip
References :
JMLR Special Issue on Variable and Feature Selection - 2003
R Kohavi, G. John, « The wrapper approach », 1997.
Wikipedia, "Naive bayes classifier".