Saturday, May 30, 2009

Understanding the "test value" criterion

The test value (VT) is a criterion often used in various components of TANAGRA. It is mainly used for the characterization of a group of observations according a continuous or categorical variable. The groups may be defined by categories from a discrete variable; they can also be computed by a machine learning algorithm (e.g. clustering, a node of a decision tree, etc.).

The principle is elementary: we compare the values of a descriptive statistic indicator computed on the whole sample and computed on sub sample related to the group. For a continuous variable, we compare the mean; for a discrete one, we compare the proportion.

Despite, or because of its simplicity, the VT is very useful. The formulation that we present in this tutorial is taken from the Lebart et al.’s book (2001). The VT is intensively used in some commercial software such as SPAD (http://eng.spad.eu/). It allows to characterize groups, but it can be used also to strengthen the interpretation of the factors extracted from a factorial analysis process.

In this tutorial, we emphasis the formulas used for both categorical and continuous variables. We put them in connection with the results provided by the GROUP CHARACTERIZATION component of TANAGRA.

Keywords: test value, group characterization, clustering, factorial analysis
Components: Group characterization
Tutorial: en_Tanagra_Comprendre_La_Valeur_Test.pdf
Dataset: heart_disease_male.xls
Reference:
L. Lebart, A. Morineau, M. Piron, « Statistique exploratoire multidimensionnelle », Dunod, 2000 ; pages 181 to 184.

Friday, May 29, 2009

Descriptive statistics (continued)

The aim of descriptive statistics is to describe the main features of a collection of data in quantitative terms . The visualization of the whole data table is seldom useful. It is preferable to summarize the characteristics of the data with some selected numerical indicators.

In this tutorial, we distinguish two kinds of descriptive approaches: the univariate tools which summarize the characteristics of a variable individually; the bivariate tools which characterize the association between two variables. According to the type of the variables (categorical or continuous), we use different indicators.

Keywords: descriptive statistics
Components: UNIVARIATE DISCRETE STAT, CONTINGENCY CHI-SQUARE, UNIVARIATE CONTINUOUS STAT, SCATTERPLT, LINEAR CORRELATION, GROUP CHARACTERIZATION
Tutorial: en_Tanagra_Descriptive_Statistics.pdf
Dataset: enquete_satisfaction_femmes_1953.xls
References:
Tanagra Tutorials, "Descriptive statistics"

Friday, May 1, 2009

ID3 on a large dataset

In the data mining domain, the increasing size of the dataset is one of the major challenges in the recent years. The ability to handle large data sets is an important criterion to distinguish between research and commercial software.

Commercial tools have often a very efficient data management systems, limiting the amount of data loaded into memory at each step of the treatment. Research tools, at the opposite, keep all data in memory. The limits are clearly the memory capacity of the machine in this context. It is certainly a drawback for the treatment of large files. We note however that, nowadays, we can have very powerful computers at least cost, this drawback is always postponed. With an appropriate encoding strategy, we can fit in memory all the dataset, even if we handle a large data file.

In this tutorial, we show how to import a file with 581,012 observations and 55 variables, and then how to build a decision tree with the ID3 method. In relation to other decision tree algorithm such as C4.5 or CART, the determination of the right size of the tree is based on a pre-pruning rule. We will see that the computation is fast because of this characteristic.

Keywords: large dataset, decision tree algorithm, ID3
Components: ID3, SPV LEARNING
Tutorial: en_Tanagra_Big_Dataset.pdf
Dataset: covtype.zip
References:
Tanagra tutorials, "Performance comparison under Linux"
Tanagra Tutorials, "Decision tree and large dataset"