Saturday, November 8, 2008

K-Means algorithm on discrete attributes

In this tutorial, we show how to perform a K-Means clustering. We validate the results by comparing the clusters with a predefined classification.

We address an additional problem in this tutorial. Descriptors are categorical. We can not directly launch the K-Means with the usual Euclidean distance. We propose to use in 2 steps: (1) transform the original dataset using a correspondence analysis; (2) launch the K-Means on the X first latent variables. We then can use the algorithm standard K-Means based on Euclidean distance in this second step.

Keywords: clustering, k-means, correspondence analysis, cluster description
Components : Multiple Correspondance Analysis, K-Means, Group characterization, Cross Tabulation
Tutorial: en_dr_clustering_validation_externe.pdf
Dataset: dr_vote.bdm
References:
Wikipédia, « K-means algorithm ».
Statsoft Inc., "Correspondence Analysis".