Monday, June 29, 2009

Univariate outlier detection methods

The detection and the treatment of outliers (individuals with unusual values) is an important task of data preparation. Unusual values can mislead results of subsequent data analysis.

Outliers can be detected on one variable (a man with 158 years old) or on a combination of variables (a boy with 12 years old crosses the 100 yards in 10 seconds). In this tutorial, we show how to use the UNIVARIATE OUTLIER DETECTION component. It is intended to univariate detection of outliers i.e. taking into account individually the variables.

The approaches implemented in the component come from the NIST website (see reference). We use also an additional rule based on the x-sigma deviation from the mean of the variable.

Keywords: outlier, influential point
Components: MORE UNIVARIATE CONT STAT, SCATTERPLOT WITH LABEL, UNIVARIATE OUTLIER DETECTION, UNIVARIATE CONT STAT
Tutorial: en_Tanagra_Outliers_Detection.pdf
Dataset: body_mass_index.xls
References:
NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.1.6, « What are outliers in the data ? »
R. High, "Dealing with 'Outliers': How to Maintain Your Data's Integrity"

Copy paste feature into the diagram

When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the diagram. This feature is very helpful when we have to repeat sequences of treatments in different parts of the diagram. The settings are also duplicated.

In this tutorial, we show how to copy a component or a branch. We will see that this feature is helpful when, for instance, we deal with the performance comparisons of supervised learning algorithms on the same dataset. In this context, the processing sequence is always the same, only the method that we want to evaluate is different.

We work on the same project here. We cannot copy paste components between two opened projects. But, in another tutorial, we show how to save a part of the diagram in an external file. Thus, the same processing sequence can be applied on multiple datasets.

Keywords: copy paste, diagram management, comparison of classifiers, supervised learning, cross validation, dimensionality reduction
Components: Supervised learning, Binary logistic regression, C-PLS, C-SVC, Linear discriminant analysis, K-NN, Principal Component Analysis
Tutorial: en_Tanagra_Diagram_New_Features.pdf
Dataset: sonar.xls

Saturday, June 27, 2009

The A PRIORI MR component

Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain e.g. if a customer buys onions and potatoes then he buys also beef. But, in fact, it can be implemented in various application areas where we want to discover the association between variables.

We were already described the association rule mining tools of Tanagra in several tutorials. The A PRIORI approach is certainly the most popular. But, despite its good properties, this method has a drawback: the number of obtained rules can be very high. The ability to underline the most interesting rules, those which are relevant, becomes a major challenge.

In this tutorial, we show to implement the A PRIORI MR component. It differentiates oneself from other by offering additional tools for exploring and assessing the mined rules: original measures based on the “test value” principle allow to evaluate differently the rules; the ability to copy the results into a spreadsheet allows a more detailed exploration of the rule base; by subdividing the dataset into train and test sets, we obtain a more reliable values of the interestingness measures of rules.

Keywords: association rule, a priori algorithm, interestingness measure, test value principle
Components: A PRIORI MR
Tutorial: en_Tanagra_APrioriMR_Component.pdf
Dataset: credit_assoc.xls
Reference:
Wikipedia, "Association rule learning"

Sunday, June 14, 2009

Two-step clustering for handling large databases

The aim of the clustering is to identify homogenous subgroups of instance in a population. In this tutorial, we implement a two-step clustering algorithm which is well-suited when we deal with a large dataset. It combines the ability of the K-Means clustering to handle a very large dataset, and the ability of the Hierarchical clustering (HCA – Hierarchical Cluster Analysis) to give a visual presentation of the results called “dendrogram”. This one describes the clustering process, starting from unrefined clusters, until the whole dataset belongs to one cluster. It is especially helpful when we want to detect the appropriate number of clusters.

The implementation of the two-step clustering (called also “Hybrid Clustering”) under Tanagra is already described elsewhere. According to the Lebart and al. (2000) recommendation , we perform the clustering algorithm on the latent variables supplied by a PCA (Principal Component Analysis) computed from the original variables. This pre-treatment cleans the dataset by removing the irrelevant information such as noise, etc. In this tutorial, we show the efficiency of the approach on a large dataset with 500,000 observations and 68 variables. We use Tanagra 1.4.27 and R 2.7.2 which are the only tools which allow to implement easily the whole process.

Keywords: clustering, hierarchical cluster analysis, HCA, k-means, principal component analysis, PCA
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, HAC, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_CAH_Mixte_Gros_Volumes.pdf
Dataset: sample-census.zip
References:
L. Lebart, A. Morineau, M. Piron, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000 ; chapter 2, sections 2.3 et 2.4.
D. Garson, "Cluster Analysis" from North Carolina State University.

Thursday, June 11, 2009

K-Means - Comparison of free tools

K-means is a clustering (unsupervised learning) algorithm. The aim is to create homogeneous subgroups of examples. The individuals in the same subgroup are similar; the individuals in different subgroups are as different as possible.

The K-Means approach is already described in several tutorials (http://data-mining-tutorials.blogspot.com/search?q=k-means). The goal here is to compare its implementation with various free tools. We study the following tools: Tanagra 1.4.28; R 2.7.2 without additional package; Knime 1.3.5; Orange 1.0b2 and RapidMiner Community Edition.

Keywords: clustering, k-means, PCA, principal component analysis, MDS,multidimensional scaling
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, EXPORT DATASET
Tutorial: en_Tanagra_et_les_autres_KMeans.pdf
Dataset: cars_dataset.zip
Reference:
D. Garson, "Cluster Analysis"