Friday, November 7, 2008

HAC and Hybrid Clustering

HAC (Hierarchical Agglomerative Clustering) is a clustering method that produces “natural “ groups of examples characterized by attributes. A tree, called dendrogram, where successive agglomerations are showed, starting from one example per cluster, until the whole dataset belong to one cluster, describes the clustering process.

The main advantage of HAC is the user can guess the right partitioning by visualizing the tree, he usually prune the tree between nodes presenting an important variation. The main disadvantage is that requires the computation of distances between each example, which is very time consuming when the dataset size increases.

TANAGRA implements the standard HAC, but it implements also a variation of HAC called HYBRID CLUSTERING. Knowing that we need often a very few number of clusters, the construction of the low part of the tree is reserved for a fast method.

There are two steps in the new algorithm:
• First, a low-level clusters are built from fast clustering method such as K-MEANS, SOM;
• HAC starts form these clusters and builds the dendogram.

Note that any clustering algorithm can provide the low level clusters, users can also specify them. Last, rather than the tree itself, it is the gap between the nodes which is important, these values are provided in a table.

Keywords: clustering, unsupervised learning, HAC, K-Means
Components: HAC , K-Means, Group characterization
Tutorial: enHAC_IRIS.pdf
Dataset: iris_hac.bdm
L. Lebart, A. Morineau, M. Piron, " Statistique exploratoire multidimensionnelle ", Dunod, 2000 ; pp. 177 - 184.
Matteo Matteucci, "A tutorial on Clustering Algorithms - Hierarchical Clustering Algorithms".