tag:blogger.com,1999:blog-54968157558613707992017-09-11T18:05:21.334+02:00Tanagra - Data Mining and Data Science TutorialsThis Web log maintains an alternative layout of the tutorials about Tanagra. Each entry describes shortly the subject, it is followed by the link to the tutorial (pdf) and the dataset. The technical references (book, papers, website,...) are also provided. In some tutorials, we compare the results of Tanagra with other free software such as Knime, Orange, R software, Python, Sipina or Weka.Tanagranoreply@blogger.comBlogger253125tag:blogger.com,1999:blog-5496815755861370799.post-87008239040902748212017-09-11T18:05:00.003+02:002017-09-11T18:05:21.343+02:00Association rule learning with ARSSIPINA is known for its decision tree induction algorithms. In fact, the distribution includes two other tools that are little known to the public: REGRESS, which is specialized in multiple linear regression, we described it in one of our tutorials ; and an association rules extraction tool, called simply Association Rule Software (ARS). <br /><br />In this tutorial, I describe the use of the ARS tool. Its interactivity with Excel spreadsheet is its main advantage. We launch the software from Excel using the “sipina.xla” add-in. We can easily retrieve the rules in the spreadsheet. Then, we can explore them (the mined rules) using the Excel data handling capabilities. The ability to filter and sort rules according to different criteria is a great help in detecting interesting rules. This is a very important aspect because the profusion of rules can quickly confuse the data miner.<br /><br /><b>Keywords</b>: ARS, association rule software, excel spreadsheet, filtering and sorting rules, interestingness measures<br /><b>Components</b>: ASSOCIATION RULE SOFTWARE<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Association_Sipina.pdf" target="_blank">en_Tanagra_Association_Sipina.pdf</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/market_basket.zip" target="_blank">market_basket.zip</a><br /><b>References</b>:<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2014/08/association-rule-learning-slides.html" target="_blank">Association rule learning (slides)</a>", August 2014.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-69414803694186422022017-08-25T09:23:00.002+02:002017-08-25T09:25:23.470+02:00Linear classifiersIn this tutorial, we study the behavior of 5 linear classifiers on artificial data. Linear models are often the baseline approaches in supervised learning. Indeed, based on a simple linear combination of predictive variables, they have the advantage of simplicity: the reading of the influence of each descriptor is relatively easy (signs and values of the coefficients); learning techniques are often (not always) fast, even on very large databases. We are interested in: (1) the naive bayes classifier; (2) the linear discriminant analysis; (3) the logistic regression; (4) the perceptron (single-layer perceptron); (5) the support vector machine (linear SVM).<br /><br />The experiment was conducted under R. The source code accompanies this document. My idea, besides the theme of the linear classifiers that concerns us, is also to describe the different stages of the elaboration of an experiment for the comparison of learning techniques. In addition, we show also the results provided by the linear approaches implemented in various tools such as <span style="color: #38761d;">Tanagra</span>, <span style="color: #38761d;">Knime</span>, <span style="color: #38761d;">Orange</span>, <span style="color: #38761d;">Weka</span> and <span style="color: #38761d;">RapidMiner</span>.<br /><br /><b>Keywords</b>: linear classifier, naive bayes, linear discriminant analysis, logistic regression, perceptron, neural network, linear svm, support vector machine, decision tree, rpart, random forest, k-nn, nearest neighbors, e1071 package, nnet package, rf package, class package<br /><b>Components</b> : NAIVE BAYES CONTINUOUS, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, MULTILAYER PERCEPTRON, SVM<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Linear_Classifier.pdf" target="_blank">en_Tanagra_Linear_Classifier.pdf</a><br /><b>Programs and dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/linear_classifier.zip" target="_blank">linear_classifier.zip</a><br /><b>References</b>:<br />Wikipedia, "<a href="http://en.wikipedia.org/wiki/Linear_classifier" target="_blank">Linear Classifier</a>". Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-75621898200022760642017-08-18T13:46:00.003+02:002017-08-18T13:46:45.484+02:00Discriminant analysis and linear regressionLinear discriminant analysis and linear regression are both supervised learning techniques. But, the first one is related to classification problems i.e. the target attribute is categorical; the second one is used for regression problems i.e. the target attribute is continuous (numeric).<br /><br />However, there are strong connections between these approaches when we deal with a binary target attribute. From a practical example, we describe the connections between the two approaches in this case. We detail the formulas for obtaining the coefficients of discriminant analysis from those of linear regression.<br /><br />We perform the calculations under Tanagra and R.<br /><br /><b>Keywords</b>: linear discriminant analysis, predictive discriminant analysis, multiple linear regression, wilks' lambda, mahalanobis distance, score function, linear classifier, sas, proc discrim, proc stepdisc<br /><b>Components</b>: LINEAR DISCRIMINANT ANALYSIS, MULTIPLE LINEAR REGRESSION<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_LDA_and_Regression.pdf" target="_blank">en_Tanagra_LDA_and_Regression.pdf</a><br /><b>Programs and dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/lda_regression.zip" target="_blank">lda_regression.zip</a><br /><b>References</b>: <br />C.J. Huberty, S. Olejnik, « Applied MANOVA and Discriminant Analysis »,Wiley, 2006.<br />R. Tomassone, M. Danzart, J.J. Daudin, J.P. Masson, « Discrimination et Classement », Masson, 1988.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-38688537210253574372017-08-11T22:20:00.003+02:002017-08-11T22:20:50.330+02:00Gradient boosting with R and PythonThis tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References).<br /><br />The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers.<br /><br />The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with each other. Here, more than for other machine learning methods, the trial and error strategy takes a lot of importance.<br /><br />We use R and Python with their appropriate packages.<br /><br /><b>Keywords</b>: gradient boosting, R software, decision tree, adabag package, rpart, xgboost, gbm, mboost, Python, scikit-learn package, gridsearchcv, boosting, random forest<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Gradient_Boosting.pdf" target="_blank">Gradient boosting</a><br /><b>Programs and datasets</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/gradient_boosting.zip" target="_blank">gradient_boosting.zip</a> <br /><b>References</b>:<br />Tanagra tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2016/06/gradient-boosting-slides.html" target="_blank">Gradient boosting - Slides</a>", June 2016.<br />Tanagra tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2015/12/bagging-random-forest-boosting-slides.html" target="_blank">Bagging, Random Forest, Boosting - Slides</a>", December 2015.<br />Tanagra tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2015/12/random-forest-boosting-with-r-and-python.html" target="_blank">Random Forest and Boosting with R and Python</a>", December 2015.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-48906884564195295492017-08-04T07:30:00.001+02:002017-08-04T07:30:15.692+02:00Statistical analysis with GnumericThe spreadsheet is a valuable tool for data scientist. This is what the annual KDnuggets polls reveal during these last years where Excel spreadsheet is always well placed. In France, this popularity is largely confirmed by its almost systematic presence in job postings related to the data processing (statistics, data mining, data science, big data/data analytics, etc.). Excel is specifically referred, but this success must be viewed as an acknowledgment of the skills and capabilities of the spreadsheet tools.<br /><br />This tutorial is devoted to the <a href="http://www.gnumeric.org/" target="_blank">Gnumeric</a> Spreadsheet 1.12.12. It has interesting features: Setup and installation programs are small because it is not part of an office suite; It is fast and lightweight; It is dedicated to numerical computation and natively incorporates a "statistics" menu with the common statistical procedures (parametric tests, non-parametric tests, regression, principal component analysis, etc.); and, it seems more accurate than some popular spreadsheets programs. These last two points have caught my attention and have convinced me to study it in more detail. In the following, we make a quick overview of Gnumeric's statistical procedures. If it is possible, we compare the results with those of <span style="color: #38761d;"><b>Tanagra 1.4.50</b></span>.<br /><br /><b>Keywords</b>: gnumeric, spreadsheet, descriptive statistics, principal component analysis, pca, multiple linear regression, wilcoxon signed rank test, welch test unequal variance, mann and whitney, analysis of variance, anova<br /><b>Tanagra components</b>: MORE UNIVARIATE CONT STAT, PRINCIPAL COMPONENT ANALYSIS, MULTIPLE LINEAR REGRESSION, WILCOXON SIGNED RANKS TEST, T-TEST UNEQUAL VARIANCE, MANN-WHITNEY COMPARISON, ONE-WAY ANOVA<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Gnumeric.pdf" target="_blank">en_Tanagra_Gnumeric.pdf</a><br /><b>Dataset </b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/credit_approval.zip" target="_blank">credit_approval.zip</a><br /><b>References </b>:<br />Gnumeric, "<a href="https://help.gnome.org/users/gnumeric/stable/gnumeric.html" target="_blank">The Gnumeric Manual</a>, version 1.12".Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-61469530717953079632017-08-02T07:36:00.001+02:002017-08-02T07:36:55.426+02:00Failure resolvedHi,<br /><br />It seems that the failure has been resolved since yesterday "August 1st, 2017".<br /><br />Again, sorry for the inconvenience. I hope that the continuity of service will be ensured throughout the summer.<br /><br />Kind regards,<br /><br />Ricco (August 2nd, 2017).Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-86998554443287739352017-07-27T23:50:00.003+02:002017-07-28T07:22:11.193+02:00File server outageSince a few days (since the 07/24/2017 approximately), the server of the Eric laboratory that hosts the Tanagra project files (software, books, course materials, tutorials...) is idle. After a power outage, there is nobody to restart the server during the summer period. And the server is located in a room in which I do not have access.<br /><br />So we wait. And it will take a little time, the summer break lasts a month, our University (and Lab) is officially reopened on August 21st! I am sorry for users that work from the documents that I put online. This difficulty is totally beyond my control and I cannot do anything about it.<br /><br />Some internet users are reported to me the problem. I take the initiative to inform you. As soon as the situation is back in order, I will let you know.<br /><br />Kind regards,<br /><br />Ricco (July 27th, 2017).Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-10291448189759733192017-07-22T18:28:00.003+02:002017-07-22T18:28:54.401+02:00Interpreting cluster analysis resultsInterpretation of the clustering structure and the clusters is an essential step in unsupervised learning. Identifying the characteristics that underlie differentiation between groups allows to ensuring their credibility.<br /><br />In this course material, we explore the univariate and multivariate techniques. The first ones have the merit of the ease of calculation and reading, but do not take into account the joint effect of the variables. The seconds are a priori more efficient, but require additional expertise to fully understand the results.<br /><br /><b>Keywords:</b> cluster analysis, clustering, unsupervised learning, percentage of variance explained, V-Test, test value, distance between centroids, correlation ratio<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/classif_interpretation.pdf" target="_blank"> Characterizing the clusters</a><br /><b>References</b>:<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2009/05/understanding-test-value-criterion.html" target="_blank">Understanding the 'test value' criterion</a>", May 2009.<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2017/06/hierarchical-agglomerative-clustering.html" target="_blank">Hierarchical agglomerative clustering</a>", June 2017.<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2017/06/k-means-clustering-slides.html" target="_blank">K-Means clustering</a>", June 2017.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-33036035237972856892017-07-14T09:59:00.003+02:002017-07-14T10:02:17.744+02:00Kohonen map with RThis tutorial complements the course material concerning the Kohonen map or Self-organizing map (<a href="http://data-mining-tutorials.blogspot.fr/2017/06/self-organizing-map-slides.html" target="_blank">June 2017</a>). In a first time, we try to highlight two important aspects of the approach: its ability to summarize the available information in a two-dimensional space; Its combination with a cluster analysis method for associating the topological representation (and the reading that one can do) to the interpretation of the groups obtained from the clustering algorithm. We use the R software and the “Kohonen” package (Wehrens et Buydens, 2007). In a second time, we carry out a comparative study of the quality of the partitioning with the one obtained with the K-means algorithm. We use an external evaluation i.e. we compare the clustering results with pre-established classes. This procedure is often used in research to evaluate the performance of clustering methods. It takes on its meaning when it is applied to artificial data where the true class membership is known. We use the K-Means and Kohonen-Som components of Tanagra.<br /><br />This tutorial is based on the Shane Lynn's article on the R-bloggers website (Lynn, 2014). I completed it by introducing the intermediate calculations to better understand the meaning of the charts, and by conducting the comparative study.<br /><br /><b>Keywords:</b> som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package, k-means, external evaluation, heatmaps<br /><b>Components</b>: KOHONEN-SOM<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Kohonen_SOM_R.pdf" target="_blank">Kohonen map with R</a><br /><b>Program and dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/waveform_som.zip" target="_blank">waveform - som</a><br /><b>References</b>:<br />Tanagra tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2017/06/self-organizing-map-slides.html" target="_blank">Self-organizing map (slides)</a>", June 2017.<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2009/07/self-organizing-map-som.html" target="_blank">Self-organizing map (with Tanagra)</a>", July 2009.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-42254427802299636532017-07-08T20:17:00.002+02:002017-07-08T20:18:32.193+02:00Cluster analysis with Python - HAC and K-MeansThis tutorial describes a cluster analysis process. We deal with a set of cheeses (29 instances) characterized by their nutritional properties (9 variables). The aim is to determine groups of homogeneous cheeses in view of their properties. We inspect and test two approaches using two Python procedures: the Hierarchical Agglomerative Clustering algorithm (<span style="color: #38761d;"><b>SciPy</b></span> package) ; and the K-Means algorithm (<b><span style="color: #38761d;">scikit-learn</span></b> package).<br /><br />One of the contributions of this tutorial is that we had conducted the same analysis with R previously, with the same steps. We can compare the commands used and the results provided by the available procedures. We observe that these tools have comparable behaviors and are substitutable in this context.<br /><div><br /></div><div><b>Keywords</b>: python, scipy, scikit-learn, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, principal component analysis, PCA<br /><b>Turorial</b>: <a href="https://eric.univ-lyon2.fr/~ricco/cours/didacticiels/Python/en/cah_kmeans_avec_python.pdf" target="_blank">hac and k-means with Python</a><b> </b><br /><b>Dataset and cource code</b>: <a href="https://eric.univ-lyon2.fr/~ricco/cours/didacticiels/Python/en/cah_kmeans_avec_python.zip" target="_blank">hac_kmeans_with_python.zip</a><br /><b>References</b> :<br />Marie Chavent, <a href="http://www.math.u-bordeaux1.fr/~machaven/teaching/" target="_blank">Teaching</a> Page, University of Bordeaux.</div><div>Tanagra Tutorials, "<a href="http://data-mining-tutorials.blogspot.fr/2017/07/cluster-analysis-with-r-hac-and-k-means.html" target="_blank">Cluster analysis with R - HAC and K-Means</a>", July 2017.</div>Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-46592751084293977912017-07-06T17:16:00.001+02:002017-07-06T17:16:54.603+02:00Cluster analysis with R - HAC and K-MeansThis tutorial describes a cluster analysis process. We deal with a set of cheeses (29 instances) characterized by their nutritional properties (9 variables). The aim is to determine groups of homogeneous cheeses in view of their properties.<br /><br />We inspect and test two approaches using two procedures of the R software: the Hierarchical Agglomerative Clustering algorithm (hclust) ; and the K-Means algorithm (kmeans).<br /><br />The data file "fromage.txt" comes from the teaching page of Marie Chavent from the University of Bordeaux. The excellent course materials and corrected exercises (commented R code) available on its website will complete this tutorial, which is intended firstly as a simple guide for the introduction of the R software in the context of the cluster analysis.<br /><br /><b>Keywords</b>: R software, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, fpc package, principal component analysis, PCA<br /><b>Components</b>: hclust, kmeans, kmeansruns<br /><b>Turorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/didacticiels/R/en/cah_kmeans_avec_r.pdf" target="_blank">hac and k-means with R</a><b> </b><br /><b>Dataset and cource code</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/didacticiels/R/en/cah_kmeans_avec_r.zip" target="_blank">hac_kmeans_with_r.zip</a><br /><b>References</b> :<br />Marie Chavent, <a href="http://www.math.u-bordeaux1.fr/~machaven/teaching/" target="_blank">Teaching</a> Page, University of Bordeaux. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-56774951359059998222017-07-03T22:35:00.003+02:002017-07-03T22:35:30.296+02:00k-medoids clustering (slides)K-medoids is a partitioning-based clustering algorithm. It is related to the k-means but, instead of using the centroid as reference data point for the cluster, we use the medoid which is the individual nearest to all the other points within its cluster. One of the main consequence of this approach is that the resulting partition is less sensible to outliers.<br /><br />This course material describes the algorithm. Then, we focus on the silhouette tool which can be used to determine the right number of clusters, a recurring open problem in cluster analysis.<br /><br /><b>Keywords:</b> cluster analysis, clustering, unsupervised learning, paritionning method, relocation approach, medoid, PAM, partitioning aroung medoids, CLARA, clustering large applications, silhouette, silhouette plot<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/classif_k_medoides.pdf" target="_blank"> Cluster analysis - k-medoids algorithm</a><br /><b>References</b>:<br />Wikipedia, "<a href="https://en.wikipedia.org/wiki/K-medoids" target="_blank">k-medoids</a>".Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-87546993107143521802017-06-20T19:25:00.003+02:002017-06-20T19:25:43.090+02:00k-means clustering (slides)K-Means clustering is a popular cluster analysis method. It is simple and its implementation does not require to keep in memory all the dataset, thus making it possible to process very large databases.<br /><br />This course material describes the algorithm. We focus on the different extensions such as the processing of qualitative or mixed variables, fuzzy c-means, and clustering of variables (clustering around latent variables). We note that the k-means method is relatively adaptable and can be applied to a wide range of problems.<br /><br /><b>Keywords:</b> cluster analysis, clustering, unsupervised learning, partition method, relocation<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/classif_centres_mobiles.pdf" target="_blank"> K-Means clustering</a><br /><b>References</b> :<br />Wikipedia, "<a href="https://en.wikipedia.org/wiki/K-means_clustering" target="_blank">k-means clustering</a>".<br />Wikipedia, "<a href="https://en.wikipedia.org/wiki/Fuzzy_clustering" target="_blank">Fuzzy clustering</a>".Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-17664535886088080022017-06-13T16:03:00.003+02:002017-06-20T06:35:10.924+02:00Self-Organizing Map (slides)A self-organizing map (SOM) or Kohonen network or Kohonen map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, which preserves the topological properties of the input space (<a href="https://en.wikipedia.org/wiki/Self-organizing_map" target="_blank">Wikipedia</a>).<br /><br />SOM is useful for the dimensionality reduction, data visualization and cluster analysis. In this course material, we outline the mechanisms underlying the approach. We focus on its practical aspects (e.g. various visualization possibilities, prediction on a new instance, extension of SOM to the clustering task,…).<br /><br />Illustrative examples in <b><span style="color: #38761d;">R</span></b> (kohonen package) and <span style="color: #38761d;"><b>Tanagra</b></span> are briefly presented.<br /><br /><b>Keywords:</b> som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package<br /><b>Components</b>: KOHONEN-SOM<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/kohonen_som.pdf" target="_blank">Kohonen SOM</a><br /><b>References</b>:<br />Wikipedia, "<a href="https://en.wikipedia.org/wiki/Self-organizing_map" target="_blank">Self-organizing map</a>". Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-59681706494312581842017-06-10T19:12:00.005+02:002017-06-20T06:35:24.296+02:00Hierarchical agglomerative clustering (slides)In data mining, cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) (<a href="https://en.wikipedia.org/wiki/Cluster_analysis" target="_blank">Wikipedia</a>).<br /><br />In this course material, we focus on the hierarchical agglomerative clustering (HAC). Beginning from the individuals which initially represents groups, the algorithms merge the groups in a bottom-up fashion until only the instances are gathered in only one group. The process is materialized by a dendrogram which allows to evaluate the nature of the solution and helps to determine the appropriate number of clusters.<br /><br />Examples of analysis under <b><span style="color: #38761d;">R</span></b>, <span style="color: #38761d;"><b>Python</b></span> and <span style="color: #38761d;"><b>Tanagra</b></span> are described.<br /><br /><b>Keywords:</b> hac, cluster analysis, clustering, unsupervised learning, tandem analysis, two-step clustering, R software, hclust, python, scipy package<br /><b>Components:</b> HAC, K-MEANS<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/cah.pdf" target="_blank">cah.pdf</a><br /><b>References</b>:<br />Wikipedia, "<a href="https://en.wikipedia.org/wiki/Cluster_analysis" target="_blank">Cluster analysis</a>".<br />Wikipedia, "<a href="https://en.wikipedia.org/wiki/Hierarchical_clustering" target="_blank">Hierarchical clustering</a>".Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-770397190471405512017-05-20T08:44:00.001+02:002017-06-20T06:35:36.158+02:00Support vector machine (slides)In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis (<a href="https://en.wikipedia.org/wiki/Support_vector_machine" target="_blank">Wikipedia</a>).<br /><br />These slides show the background of the approach in the classification context. We address the binary classification problem, the soft-margin principle, the construction of the nonlinear classifiers by means of the kernel functions, the feature selection process, the multiclass SVM.<br /><br />The presentation is complemented by the implementation of the approach under the open source software Python (Scikit-Learn), R (e1071) and Tanagra (SVM and C-SVC).<br /><br /><b>Keywords</b>: svm, e1071 package, R software, Python, scikit-learn package, sklearn<br /><b>Components</b>: SVM, C-SVC<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/svm.pdf" target="_blank">Support Vector Machine (SVM)</a><br /><b>Dataset:</b><a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/fichiers/svm%20exemples.xlsx" target="_blank"> svm exemples.xlsx</a><br /><b>References</b>:<br />Abe S., "Support Vector Machines for Pattern Classification", Springer, 2010.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-45080131583644140482017-01-05T18:40:00.000+01:002017-01-05T18:40:34.203+01:00Tanagra website statistics for 2016The year 2016 ends, 2017 begins. I wish you all a very happy year 2017.<br /><br />A small statistical report on the website statistics for the 2016. All sites (Tanagra, course materials, e-books, tutorials) has been visited 264,045 times this year, <span style="color: #6aa84f;"><span style="color: #38761d;"><b>721 visits per day</b></span></span>.<br /><br />Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there are 2,111,078 visits (649 daily visits).<br /><br />Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Brazil, Germany, ...<br /><br />The pages containing course materials about Data Mining and R Programming are the most popular ones. This is not really surprising.<br /><br />Happy New Year 2017 to all.<br /><br />Ricco.<br /><span style="font-weight: bold;">Slideshow</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Frequentation_2016.pdf" target="_blank">Website statistics for 2016</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-56237610640481938882016-09-17T11:35:00.003+02:002017-01-17T10:05:26.023+01:00Text mining - Document classificationThe statistical approach of the "text mining" consists in to transform a collection of text documents in a matrix of numeric values on which we can apply machine learning algorithms.<br /><br />The "unstructured document" designation is often used when one talks about text documents. This does not mean that he does not have a certain organization (titles, chapters, paragraphs, questions and answers, etc.). It shows first of all that we cannot express directly the collection in the form of a data table that is usually handled in data mining. To obtain this kind of data representation, a preprocessing phase is needed, then we extract relevant features to define the data table. These steps can influence heavily the relevance of the results.<br /><br />In this tutorial, I take an exercise that I lead with my students for my text mining course at the University. We perform all the analysis under R with the dedicated packages for text mining such as “XML” or “tm”. The issue here is to perform exactly the study using other tools such as <a href="https://www.knime.org/" target="_blank">Knime</a> 2.9.1 or <a href="https://rapidminer.com/products/studio/" target="_blank">RapidMiner</a> 5.3 (<span style="color: #666666;"><i><u>Note</u>: these are the versions available when I wrote the French version of this tutorial in April 2014</i></span>). We will see that these tools provide specialized libraries which enable to perform efficiently a statistical text mining process.<br /><br /><b>Keywords</b>: text mining, document classification, text categorization, decision tree, j48, lineat svm, <a href="http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection" target="_blank">reuters</a> collection, XML format, stemming, stopwords, document-term matrix<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Text_Mining.pdf" target="_blank">en_Tanagra_Text_Mining.pdf</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/text_mining_tutorial.zip" target="_blank">text_mining_tutorial.zip</a><br /><b>References </b>:<br />Wikipedia, "<a href="http://en.wikipedia.org/wiki/Text_categorization" target="_blank">Document classification</a>". <br />S. Weiss, N. Indurkhya, T. Zhang, "Fundamentals of Predictive Text Mining", Springer, 2010. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-67173768668576039082016-06-25T06:25:00.002+02:002016-06-25T06:25:27.600+02:00Image classification with KnimeThe aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a label to image from their visual content. The whole process is identical to the standard data mining process. We learn a classifier from a set of classified images. Then, we can apply the classifier to a new image in order to predict its class membership. The particularity is that we must extract a vector of numerical features from the image before to launch the machine learning algorithm, and before to apply the classifier in the deployment phase.<br /><br />We deal with an image classification task in this tutorial. The goal is to detect automatically the images which contain a car. The main result is that, even if I have a basic knowledge about the image processing, I can lead the analysis with a facility which is symptomatic of the usability of Knime in this context.<br /><br /><b>Keywords</b>: image mining, image classification, image processing, feature extraction, decision tree, random forest, knime<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Image_Mining_Knime.pdf" target="_blank">en_Tanagra_Image_Mining_Knime.pdf</a><br /><b>Dataset and program (Knime archive)</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/Tuto_Image_Mining.zip" target="_blank">image mining tutorial</a><br /><b>References</b>:<br />Knime Image Processing, <a href="https://tech.knime.org/community/image-processing" target="_blank">https://tech.knime.org/community/image-processing</a><br />S. Agarwal, A. Awan, D. Roth, « UIUC Image Database for Car Detection » ; <a href="https://cogcomp.cs.illinois.edu/Data/Car/" target="_blank">https://cogcomp.cs.illinois.edu/Data/Car/</a>Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-63499156579086674062016-06-19T15:24:00.000+02:002016-06-19T15:25:32.321+02:00Gradient boosting (slides)The "gradient boosting" is an ensemble method that generalizes boosting by providing the opportunity of use other loss functions ("standard" boosting uses implicitly an exponential loss function).<br /><br />These slides show the ins and outs of the method. Gradient boosting for regression is detailed initially. The classification problem is presented thereafter.<br /><br />The solutions implemented in the packages for R and Python are studied.<br /><br /><b>Keywords</b>: boosting, regression tree, package gbm, package mboost, package xgboost, R, Python, package scikit-learn, sklearn<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/gradient_boosting.pdf" target="_blank">Gradient Boosting</a><br /><b>References</b>:<br />R. Rakotomalala, "<a href="http://data-mining-tutorials.blogspot.fr/2015/12/bagging-random-forest-boosting-slides.html" target="_blank">Bagging, Random Forest, Boosting</a>", December 2015.<br />Natekin A., Knoll A., "<a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/" target="_blank">Gradient boosting machines, a tutorial</a>", in <i>Frontiers in Neurorobotics</i>, December 2013. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-28726956402920504582016-06-13T17:22:00.005+02:002016-06-13T17:22:59.083+02:00Tanagra and Sipina add-ins for Excel 2016The add-ins “tangra.xla” and “sipina.xla” are greatly involved in popularity of Tanagra and Sipina software applications. They incorporate menus dedicated to data mining in Excel. They implement a simple bridge between the data into the spreadsheet and Tanagra or Sipina.<br /><br />I developed and tested the latest add-ins versions for Excel 2007 and 2010. I had access recently to Excel 2016. I checked the add-ins. The conclusion is that the tools work without a hitch.<br /><br /><span style="font-weight: bold;">Keywords</span>: data importation, excel data file, add-in, add-on, xls, xlsx<br /><span style="font-weight: bold;">Lien </span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Add_In_Excel_2016.pdf" target="_blank">en_Tanagra_Add_In_Excel_2016.pdf</a><br /><b>References</b>:<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2010/08/tanagra-add-in-for-office-2007-and.html" target="_blank">Tanagra add-in for Excel 2007 and 2010</a>", August 2010.<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2016/06/sipina-add-in-for-excel-2007-and-2010.html" target="_blank">Sipina add-in for Excel 2007 and 2010</a>", June 2016.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-79086815304508916482016-06-12T11:36:00.002+02:002016-06-13T17:24:48.701+02:00Sipina add-in for Excel 2007 and 2010SIPINA is a Data Mining Software which implements various supervised learning paradigms. This is an old tool but it is still used because this is the only free tool which provides fully functional interactive decision tree capabilities.<br /><br />This tutorial briefly describes the installation and the use of the add-in "sipina.xla" into Excel 2007. The approach is easily generalized to Excel 2010. A similar document exists for Tanagra . It seemed to me nevertheless necessary to clarify the procedure, especially because several users have made the request. Other tutorials exist for earlier versions of Excel (1997-2003) and for Calc (Libre Office and Open Office).<br /><br />A new tutorial will come soon. It shows that the add-in operates properly also under <a href="http://data-mining-tutorials.blogspot.fr/2016/06/tanagra-and-sipina-add-ins-for-excel.html" target="_blank">Excel 2016</a>.<br /><br /><span style="font-weight: bold;">Keywords</span>: data importation, excel data file, add-in, add-on, xls, xlsx<br /><span style="font-weight: bold;">Tutorial</span>: <a href="http://eric.univ-lyon2.fr/~ricco/doc/en_sipina_excel_addin.pdf" target="_blank">en_sipina_excel_addin.pdf</a><br /><span style="font-weight: bold;">Dataset</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart.xls" target="_blank">heart.xls</a><br /><span style="font-weight: bold;">References</span>:<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2010/08/tanagra-add-in-for-office-2007-and.html" target="_blank">Tanagra add-in for Office 2007 and Office 2010</a>", august 2010.<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2016/06/tanagra-and-sipina-add-ins-for-excel.html" target="_blank">Tanagra and Sipina add-ins for Excel 2016</a>", June 2016. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-25421025910113694012016-04-03T08:46:00.002+02:002016-04-03T08:56:24.870+02:00Categorical predictors in logistic regressionThe aim of the logistic regression is to build a model for predicting a binary target attribute from a set of explanatory variables (predictors, independent variables), which are numeric or categorical. They are treated as such when they are numeric. We must recode them when they are categorical. The dummy coding is undeniably the most popular approach in this context.<br /><br /><span style="color: #d2744c;"><b>The situation becomes more complicated when we perform a feature selection</b></span>. The idea is to determine the predictors that contribute significantly to the explanation of the target attribute. There is no problem when we consider a numeric variable. It is either excluded or either kept in the model. But how to proceed when we handle a categorical explanatory variable? Should we treat the dichotomous variables associated to a categorical predictor as a whole that we must exclude or include into the model? Or should we treat the each dichotomous variable independently? How to interpret the coefficients of the selected dichotomous variables in this case?<br /><br />In this tutorial, we study the approaches proposed by various tools: <span style="color: #38761d;"><b>R 3.1.2</b></span>, <span style="color: #38761d;"><b>SAS 9.3</b></span>, <span style="color: #38761d;"><b>Tanagra 1.4.50</b></span> and <span style="color: #38761d;"><b>SPAD 8.0</b></span>. We will see that feature selection algorithms rely on specific criteria according to the software. We will see also that they use different approaches when we are in the presence of the categorical predictor variables.<br /><br /><b>Keywords</b>: logistic regression, dummy coding, categorical predictor variables, feature selection<br /><b>Components</b>: O_1_BINARIZE, BINARY LOGISTIC REGRESSION, BACKWARD-LOGIT<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Categorical_Selection_Log_Reg.pdf" target="_blank">Feature selection - Categorical predictors - Logistic Regression</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart-c.xlsx" target="_blank">heart-c.xlsx</a><b> </b><br /><b>References</b>:<br />Wikipedia, "<a href="http://en.wikipedia.org/wiki/Logistic_regression" target="_blank">Logistic Regression</a>" Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-63873325751655162062016-03-31T19:06:00.003+02:002016-04-03T08:50:42.788+02:00Dummy coding for categorical predictor variablesIn this tutorial, we show how to perform a dummy coding for categorical predictor variables in the context of the logistic regression learning process.<br /><br />In fact, this is an old tutorial that I was written a long time ago (2007), but it is not referenced in this blog (which was created in 2008). I found it in my archives because I plan to write soon a tutorial about the strategies for the selection of categorical variables in logistic regression. I was wondering if I had already written something that may be linked to this subject (the treatment of the categorical predictors in logistic regression) in the past. Obviously, I would have to check most often my archives.<br /><br />We use <span style="color: #38761d;"><b>Tanagra 1.4.50</b></span> in this tutorial.<br /><br /><b>Keywords</b>: logistic regression, dummy coding, categorical predictor variables <br /><b>Components</b>: SAMPLING, O_1_BINARIZE, BINARY LOGISTIC REGRESSION, TEST<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Dummy_Coding_for_Logistic_Regression.pdf" target="_blank">Dummy coding - Logistic Regression</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart-c.xlsx" target="_blank">heart-c.xlsx</a><b> </b><br /><b>References</b>:<br />Wikipedia, "<a href="http://en.wikipedia.org/wiki/Logistic_regression" target="_blank">Logistic Regression</a>" Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-71807692234652249632016-03-13T17:22:00.001+01:002016-03-31T19:08:12.691+02:00Cost-Sensitive Learning (slides)This course material presents approaches for the consideration of misclassification costs in supervised learning. The baseline method is the one for which we do not take into account the costs.<br /><br />Two issues are studied : the metric used for the evaluation of the classifier when a misclassification cost matrix is provided i.e. the expected cost of misclassification (ECM); some approaches which enable to guide the machine learning algorithm towards the minimization of the ECM.<br /><br /><b>Keywords</b>: cost matrix, misclassification, expected cost of misclassification, bagging, metacost, multicost<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/couts_en_apprentissage_supervise.pdf" target="_blank">Cost Sensitive Learning</a><br /><b>References</b>:<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2009/03/cost-sensitive-learning-comparison-of.html" target="_blank">Cost-senstive learning - Comparison of tools</a>", March 2009.<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2008/11/cost-sensitive-decision-trees.html" target="_blank">Cost-sensitive decision tree</a>", November 2008.Tanagranoreply@blogger.com