Monday, June 29, 2009
Outliers can be detected on one variable (a man with 158 years old) or on a combination of variables (a boy with 12 years old crosses the 100 yards in 10 seconds). In this tutorial, we show how to use the UNIVARIATE OUTLIER DETECTION component. It is intended to univariate detection of outliers i.e. taking into account individually the variables.
The approaches implemented in the component come from the NIST website (see reference). We use also an additional rule based on the x-sigma deviation from the mean of the variable.
Keywords: outlier, influential point
Components: MORE UNIVARIATE CONT STAT, SCATTERPLOT WITH LABEL, UNIVARIATE OUTLIER DETECTION, UNIVARIATE CONT STAT
NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.1.6, « What are outliers in the data ? »
R. High, "Dealing with 'Outliers': How to Maintain Your Data's Integrity"
When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the diagram. This feature is very helpful when we have to repeat sequences of treatments in different parts of the diagram. The settings are also duplicated.
In this tutorial, we show how to copy a component or a branch. We will see that this feature is helpful when, for instance, we deal with the performance comparisons of supervised learning algorithms on the same dataset. In this context, the processing sequence is always the same, only the method that we want to evaluate is different.
We work on the same project here. We cannot copy paste components between two opened projects. But, in another tutorial, we show how to save a part of the diagram in an external file. Thus, the same processing sequence can be applied on multiple datasets.
Keywords: copy paste, diagram management, comparison of classifiers, supervised learning, cross validation, dimensionality reduction
Components: Supervised learning, Binary logistic regression, C-PLS, C-SVC, Linear discriminant analysis, K-NN, Principal Component Analysis
Saturday, June 27, 2009
We were already described the association rule mining tools of Tanagra in several tutorials. The A PRIORI approach is certainly the most popular. But, despite its good properties, this method has a drawback: the number of obtained rules can be very high. The ability to underline the most interesting rules, those which are relevant, becomes a major challenge.
In this tutorial, we show to implement the A PRIORI MR component. It differentiates oneself from other by offering additional tools for exploring and assessing the mined rules: original measures based on the “test value” principle allow to evaluate differently the rules; the ability to copy the results into a spreadsheet allows a more detailed exploration of the rule base; by subdividing the dataset into train and test sets, we obtain a more reliable values of the interestingness measures of rules.
Keywords: association rule, a priori algorithm, interestingness measure, test value principle
Components: A PRIORI MR
Wikipedia, "Association rule learning"
Sunday, June 14, 2009
The implementation of the two-step clustering (called also “Hybrid Clustering”) under Tanagra is already described elsewhere. According to the Lebart and al. (2000) recommendation , we perform the clustering algorithm on the latent variables supplied by a PCA (Principal Component Analysis) computed from the original variables. This pre-treatment cleans the dataset by removing the irrelevant information such as noise, etc. In this tutorial, we show the efficiency of the approach on a large dataset with 500,000 observations and 68 variables. We use Tanagra 1.4.27 and R 2.7.2 which are the only tools which allow to implement easily the whole process.
Keywords: clustering, hierarchical cluster analysis, HCA, k-means, principal component analysis, PCA
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, HAC, GROUP CHARACTERIZATION, EXPORT DATASET
L. Lebart, A. Morineau, M. Piron, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000 ; chapter 2, sections 2.3 et 2.4.
D. Garson, "Cluster Analysis" from North Carolina State University.
Thursday, June 11, 2009
The K-Means approach is already described in several tutorials (http://data-mining-tutorials.blogspot.com/search?q=k-means). The goal here is to compare its implementation with various free tools. We study the following tools: Tanagra 1.4.28; R 2.7.2 without additional package; Knime 1.3.5; Orange 1.0b2 and RapidMiner Community Edition.
Keywords: clustering, k-means, PCA, principal component analysis, MDS,multidimensional scaling
Components: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, EXPORT DATASET
D. Garson, "Cluster Analysis"