Tuesday, June 20, 2017

k-means clustering (slides)

K-Means clustering is a popular cluster analysis method. It is simple and its implementation does not require to keep in memory all the dataset, thus making it possible to process very large databases.

This course material describes the algorithm. We focus on the different extensions such as the processing of qualitative or mixed variables, fuzzy c-means, and clustering of variables (clustering around latent variables). We note that the k-means method is relatively adaptable and can be applied to a wide range of problems.

Keywords: cluster analysis, clustering, unsupervised learning, partition method, relocation
Slides: K-Means clustering
References :
Wikipedia, "k-means clustering".
Wikipedia, "Fuzzy clustering".

Tuesday, June 13, 2017

Self-Organizing Map (slides)

A self-organizing map (SOM) or Kohonen network or Kohonen map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, which preserves the topological properties of the input space (Wikipedia).

SOM is useful for the dimensionality reduction, data visualization and cluster analysis. In this course material, we outline the mechanisms underlying the approach. We focus on its practical aspects (e.g. various visualization possibilities, prediction on a new instance, extension of SOM to the clustering task,…).

Illustrative examples in R (kohonen package) and Tanagra are briefly presented.

Keywords: som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package
Components: KOHONEN-SOM
Slides: Kohonen SOM
Wikipedia, "Self-organizing map".

Saturday, June 10, 2017

Hierarchical agglomerative clustering (slides)

In data mining, cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) (Wikipedia).

In this course material, we focus on the hierarchical agglomerative clustering (HAC). Beginning from the individuals which initially represents groups, the algorithms merge the groups in a bottom-up fashion until only the instances are gathered in only one group. The process is materialized by a dendrogram which allows to evaluate the nature of the solution and helps to determine the appropriate number of clusters.

Examples of analysis under R, Python and Tanagra are described.

Keywords: hac, cluster analysis, clustering, unsupervised learning, tandem analysis, two-step clustering, R software, hclust, python, scipy package
Components: HAC, K-MEANS
Slides: cah.pdf
Wikipedia, "Cluster analysis".
Wikipedia, "Hierarchical clustering".

Saturday, May 20, 2017

Support vector machine (slides)

In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis (Wikipedia).

These slides show the background of the approach in the classification context. We address the binary classification problem, the soft-margin principle, the construction of the nonlinear classifiers by means of the kernel functions, the feature selection process, the multiclass SVM.

The presentation is complemented by the implementation of the approach under the open source software Python (Scikit-Learn), R (e1071) and Tanagra (SVM and C-SVC).

Keywords: svm, e1071 package, R software, Python, scikit-learn package, sklearn
Components: SVM, C-SVC
Slides: Support Vector Machine (SVM)
Dataset: svm exemples.xlsx
Abe S., "Support Vector Machines for Pattern Classification", Springer, 2010.

Thursday, January 5, 2017

Tanagra website statistics for 2016

The year 2016 ends, 2017 begins. I wish you all a very happy year 2017.

A small statistical report on the website statistics for the 2016. All sites (Tanagra, course materials, e-books, tutorials) has been visited 264,045 times this year, 721 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there are 2,111,078 visits (649 daily visits).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Brazil, Germany, ...

The pages containing course materials about Data Mining and R Programming are the most popular ones. This is not really surprising.

Happy New Year 2017 to all.

Slideshow: Website statistics for 2016