tag:blogger.com,1999:blog-54968157558613707992017-02-20T20:20:29.838+01:00Tanagra - Data Mining and Data Science TutorialsThis Web log maintains an alternative layout of the tutorials about Tanagra. Each entry describes shortly the subject, it is followed by the link to the tutorial (pdf) and the dataset. The technical references (book, papers, website,...) are also provided. In some tutorials, we compare the results of Tanagra with other free software such as Knime, Orange, R software, Python, Sipina or Weka.Tanagranoreply@blogger.comBlogger237125tag:blogger.com,1999:blog-5496815755861370799.post-45080131583644140482017-01-05T18:40:00.000+01:002017-01-05T18:40:34.203+01:00Tanagra website statistics for 2016The year 2016 ends, 2017 begins. I wish you all a very happy year 2017.<br /><br />A small statistical report on the website statistics for the 2016. All sites (Tanagra, course materials, e-books, tutorials) has been visited 264,045 times this year, <span style="color: #6aa84f;"><span style="color: #38761d;"><b>721 visits per day</b></span></span>.<br /><br />Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there are 2,111,078 visits (649 daily visits).<br /><br />Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Brazil, Germany, ...<br /><br />The pages containing course materials about Data Mining and R Programming are the most popular ones. This is not really surprising.<br /><br />Happy New Year 2017 to all.<br /><br />Ricco.<br /><span style="font-weight: bold;">Slideshow</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Frequentation_2016.pdf" target="_blank">Website statistics for 2016</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-56237610640481938882016-09-17T11:35:00.003+02:002017-01-17T10:05:26.023+01:00Text mining - Document classificationThe statistical approach of the "text mining" consists in to transform a collection of text documents in a matrix of numeric values on which we can apply machine learning algorithms.<br /><br />The "unstructured document" designation is often used when one talks about text documents. This does not mean that he does not have a certain organization (titles, chapters, paragraphs, questions and answers, etc.). It shows first of all that we cannot express directly the collection in the form of a data table that is usually handled in data mining. To obtain this kind of data representation, a preprocessing phase is needed, then we extract relevant features to define the data table. These steps can influence heavily the relevance of the results.<br /><br />In this tutorial, I take an exercise that I lead with my students for my text mining course at the University. We perform all the analysis under R with the dedicated packages for text mining such as “XML” or “tm”. The issue here is to perform exactly the study using other tools such as <a href="https://www.knime.org/" target="_blank">Knime</a> 2.9.1 or <a href="https://rapidminer.com/products/studio/" target="_blank">RapidMiner</a> 5.3 (<span style="color: #666666;"><i><u>Note</u>: these are the versions available when I wrote the French version of this tutorial in April 2014</i></span>). We will see that these tools provide specialized libraries which enable to perform efficiently a statistical text mining process.<br /><br /><b>Keywords</b>: text mining, document classification, text categorization, decision tree, j48, lineat svm, <a href="http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection" target="_blank">reuters</a> collection, XML format, stemming, stopwords, document-term matrix<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Text_Mining.pdf" target="_blank">en_Tanagra_Text_Mining.pdf</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/text_mining_tutorial.zip" target="_blank">text_mining_tutorial.zip</a><br /><b>References </b>:<br />Wikipedia, "<a href="http://en.wikipedia.org/wiki/Text_categorization" target="_blank">Document classification</a>". <br />S. Weiss, N. Indurkhya, T. Zhang, "Fundamentals of Predictive Text Mining", Springer, 2010. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-67173768668576039082016-06-25T06:25:00.002+02:002016-06-25T06:25:27.600+02:00Image classification with KnimeThe aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a label to image from their visual content. The whole process is identical to the standard data mining process. We learn a classifier from a set of classified images. Then, we can apply the classifier to a new image in order to predict its class membership. The particularity is that we must extract a vector of numerical features from the image before to launch the machine learning algorithm, and before to apply the classifier in the deployment phase.<br /><br />We deal with an image classification task in this tutorial. The goal is to detect automatically the images which contain a car. The main result is that, even if I have a basic knowledge about the image processing, I can lead the analysis with a facility which is symptomatic of the usability of Knime in this context.<br /><br /><b>Keywords</b>: image mining, image classification, image processing, feature extraction, decision tree, random forest, knime<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Image_Mining_Knime.pdf" target="_blank">en_Tanagra_Image_Mining_Knime.pdf</a><br /><b>Dataset and program (Knime archive)</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/Tuto_Image_Mining.zip" target="_blank">image mining tutorial</a><br /><b>References</b>:<br />Knime Image Processing, <a href="https://tech.knime.org/community/image-processing" target="_blank">https://tech.knime.org/community/image-processing</a><br />S. Agarwal, A. Awan, D. Roth, « UIUC Image Database for Car Detection » ; <a href="https://cogcomp.cs.illinois.edu/Data/Car/" target="_blank">https://cogcomp.cs.illinois.edu/Data/Car/</a>Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-63499156579086674062016-06-19T15:24:00.000+02:002016-06-19T15:25:32.321+02:00Gradient boosting (slides)The "gradient boosting" is an ensemble method that generalizes boosting by providing the opportunity of use other loss functions ("standard" boosting uses implicitly an exponential loss function).<br /><br />These slides show the ins and outs of the method. Gradient boosting for regression is detailed initially. The classification problem is presented thereafter.<br /><br />The solutions implemented in the packages for R and Python are studied.<br /><br /><b>Keywords</b>: boosting, regression tree, package gbm, package mboost, package xgboost, R, Python, package scikit-learn, sklearn<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/gradient_boosting.pdf" target="_blank">Gradient Boosting</a><br /><b>References</b>:<br />R. Rakotomalala, "<a href="http://data-mining-tutorials.blogspot.fr/2015/12/bagging-random-forest-boosting-slides.html" target="_blank">Bagging, Random Forest, Boosting</a>", December 2015.<br />Natekin A., Knoll A., "<a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/" target="_blank">Gradient boosting machines, a tutorial</a>", in <i>Frontiers in Neurorobotics</i>, December 2013. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-28726956402920504582016-06-13T17:22:00.005+02:002016-06-13T17:22:59.083+02:00Tanagra and Sipina add-ins for Excel 2016The add-ins “tangra.xla” and “sipina.xla” are greatly involved in popularity of Tanagra and Sipina software applications. They incorporate menus dedicated to data mining in Excel. They implement a simple bridge between the data into the spreadsheet and Tanagra or Sipina.<br /><br />I developed and tested the latest add-ins versions for Excel 2007 and 2010. I had access recently to Excel 2016. I checked the add-ins. The conclusion is that the tools work without a hitch.<br /><br /><span style="font-weight: bold;">Keywords</span>: data importation, excel data file, add-in, add-on, xls, xlsx<br /><span style="font-weight: bold;">Lien </span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Add_In_Excel_2016.pdf" target="_blank">en_Tanagra_Add_In_Excel_2016.pdf</a><br /><b>References</b>:<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2010/08/tanagra-add-in-for-office-2007-and.html" target="_blank">Tanagra add-in for Excel 2007 and 2010</a>", August 2010.<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2016/06/sipina-add-in-for-excel-2007-and-2010.html" target="_blank">Sipina add-in for Excel 2007 and 2010</a>", June 2016.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-79086815304508916482016-06-12T11:36:00.002+02:002016-06-13T17:24:48.701+02:00Sipina add-in for Excel 2007 and 2010SIPINA is a Data Mining Software which implements various supervised learning paradigms. This is an old tool but it is still used because this is the only free tool which provides fully functional interactive decision tree capabilities.<br /><br />This tutorial briefly describes the installation and the use of the add-in "sipina.xla" into Excel 2007. The approach is easily generalized to Excel 2010. A similar document exists for Tanagra . It seemed to me nevertheless necessary to clarify the procedure, especially because several users have made the request. Other tutorials exist for earlier versions of Excel (1997-2003) and for Calc (Libre Office and Open Office).<br /><br />A new tutorial will come soon. It shows that the add-in operates properly also under <a href="http://data-mining-tutorials.blogspot.fr/2016/06/tanagra-and-sipina-add-ins-for-excel.html" target="_blank">Excel 2016</a>.<br /><br /><span style="font-weight: bold;">Keywords</span>: data importation, excel data file, add-in, add-on, xls, xlsx<br /><span style="font-weight: bold;">Tutorial</span>: <a href="http://eric.univ-lyon2.fr/~ricco/doc/en_sipina_excel_addin.pdf" target="_blank">en_sipina_excel_addin.pdf</a><br /><span style="font-weight: bold;">Dataset</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart.xls" target="_blank">heart.xls</a><br /><span style="font-weight: bold;">References</span>:<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2010/08/tanagra-add-in-for-office-2007-and.html" target="_blank">Tanagra add-in for Office 2007 and Office 2010</a>", august 2010.<br />Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2016/06/tanagra-and-sipina-add-ins-for-excel.html" target="_blank">Tanagra and Sipina add-ins for Excel 2016</a>", June 2016. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-25421025910113694012016-04-03T08:46:00.002+02:002016-04-03T08:56:24.870+02:00Categorical predictors in logistic regressionThe aim of the logistic regression is to build a model for predicting a binary target attribute from a set of explanatory variables (predictors, independent variables), which are numeric or categorical. They are treated as such when they are numeric. We must recode them when they are categorical. The dummy coding is undeniably the most popular approach in this context.<br /><br /><span style="color: #d2744c;"><b>The situation becomes more complicated when we perform a feature selection</b></span>. The idea is to determine the predictors that contribute significantly to the explanation of the target attribute. There is no problem when we consider a numeric variable. It is either excluded or either kept in the model. But how to proceed when we handle a categorical explanatory variable? Should we treat the dichotomous variables associated to a categorical predictor as a whole that we must exclude or include into the model? Or should we treat the each dichotomous variable independently? How to interpret the coefficients of the selected dichotomous variables in this case?<br /><br />In this tutorial, we study the approaches proposed by various tools: <span style="color: #38761d;"><b>R 3.1.2</b></span>, <span style="color: #38761d;"><b>SAS 9.3</b></span>, <span style="color: #38761d;"><b>Tanagra 1.4.50</b></span> and <span style="color: #38761d;"><b>SPAD 8.0</b></span>. We will see that feature selection algorithms rely on specific criteria according to the software. We will see also that they use different approaches when we are in the presence of the categorical predictor variables.<br /><br /><b>Keywords</b>: logistic regression, dummy coding, categorical predictor variables, feature selection<br /><b>Components</b>: O_1_BINARIZE, BINARY LOGISTIC REGRESSION, BACKWARD-LOGIT<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Categorical_Selection_Log_Reg.pdf" target="_blank">Feature selection - Categorical predictors - Logistic Regression</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart-c.xlsx" target="_blank">heart-c.xlsx</a><b> </b><br /><b>References</b>:<br />Wikipedia, "<a href="http://en.wikipedia.org/wiki/Logistic_regression" target="_blank">Logistic Regression</a>" Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-63873325751655162062016-03-31T19:06:00.003+02:002016-04-03T08:50:42.788+02:00Dummy coding for categorical predictor variablesIn this tutorial, we show how to perform a dummy coding for categorical predictor variables in the context of the logistic regression learning process.<br /><br />In fact, this is an old tutorial that I was written a long time ago (2007), but it is not referenced in this blog (which was created in 2008). I found it in my archives because I plan to write soon a tutorial about the strategies for the selection of categorical variables in logistic regression. I was wondering if I had already written something that may be linked to this subject (the treatment of the categorical predictors in logistic regression) in the past. Obviously, I would have to check most often my archives.<br /><br />We use <span style="color: #38761d;"><b>Tanagra 1.4.50</b></span> in this tutorial.<br /><br /><b>Keywords</b>: logistic regression, dummy coding, categorical predictor variables <br /><b>Components</b>: SAMPLING, O_1_BINARIZE, BINARY LOGISTIC REGRESSION, TEST<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Dummy_Coding_for_Logistic_Regression.pdf" target="_blank">Dummy coding - Logistic Regression</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart-c.xlsx" target="_blank">heart-c.xlsx</a><b> </b><br /><b>References</b>:<br />Wikipedia, "<a href="http://en.wikipedia.org/wiki/Logistic_regression" target="_blank">Logistic Regression</a>" Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-71807692234652249632016-03-13T17:22:00.001+01:002016-03-31T19:08:12.691+02:00Cost-Sensitive Learning (slides)This course material presents approaches for the consideration of misclassification costs in supervised learning. The baseline method is the one for which we do not take into account the costs.<br /><br />Two issues are studied : the metric used for the evaluation of the classifier when a misclassification cost matrix is provided i.e. the expected cost of misclassification (ECM); some approaches which enable to guide the machine learning algorithm towards the minimization of the ECM.<br /><br /><b>Keywords</b>: cost matrix, misclassification, expected cost of misclassification, bagging, metacost, multicost<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/couts_en_apprentissage_supervise.pdf" target="_blank">Cost Sensitive Learning</a><br /><b>References</b>:<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2009/03/cost-sensitive-learning-comparison-of.html" target="_blank">Cost-senstive learning - Comparison of tools</a>", March 2009.<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2008/11/cost-sensitive-decision-trees.html" target="_blank">Cost-sensitive decision tree</a>", November 2008.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-4880719043459489122016-03-03T18:43:00.001+01:002016-03-31T19:07:33.857+02:00Hyper-threading and solid-state driveAfter more than 6 years of good and faithful service, I decided to change my computer. It must be said that the former (Intel Core 2 Quad Q9400 2.66 Ghz - 4 cores - running Windows 7 - 64 bit) began to make disturbing sounds. I am obliged to put music to cover the rumbling of the beast and be able to work quietly.<br /><br />The choice of the new computer was another matter. I spent the age of the race to the power which is necessarily fruitless anyway, given the rapid evolution of PCs. Nevertheless, I was sensitive to two aspects that I could not evaluate previously: The hyper-threading technology is effective in programming multithreaded algorithms of data mining? The use of temporary files to relieve the memory occupation takes advantage of SSD disk technology?<br /><br />The new PC runs under Windows 8.1 (I wrote the French version of this tutorial one year ago). The processor is a Core I7 4770S (3.1 Ghz). It has 4 physical cores but 8 logical cores with the hyper-threading technology. The system disk is a SSD. These characteristics allows evaluate to their influences on (1) the implementation of multithreaded version of the linear discriminant analysis described in a previous paper (“Load balanced multithreading for LDA”, September 2013), where the number of threads used can be specified by the user; (2) the use of temporary files for the induction of decision trees algorithm, which enables us to handle very large dataset (“Dealing with very large dataset in Sipina”, January 2010; up to 9,634,198 instances and 41 variables).<br /><br />In this tutorial, we reproduce the two studies using the SIPINA software. Our goal is to evaluate the behavior of these solutions (multi-threaded implementation, copy of data into temporary files to alleviate the memory occupation) on our new machine which, due to its characteristics, should expressly take advantage of them.<br /><br /><span style="font-weight: bold;">Keywords</span>: hyper-threading, ssd disk, solid-state drive, multithread, multithreading, very large dataset, core i7, sipina, decision tree, linear discriminant analysis, lda<br /><span style="font-weight: bold;">Tutorial</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Hyperthreading.pdf" target="_blank">en_Tanagra_Hyperthreading.pdf</a><br /><b>References</b>:<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2013/09/load-balanced-multithreading-for-lda.html" target="_blank">Load balanced multithreading for LDA</a>", September 2013.<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2010/01/dealing-with-very-large-dataset-in.html" target="_blank">Dealing with very large dataset in Sipina</a>", January 2010. <br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2010/11/multithreading-for-decision-tree.html" target="_blank">Multithreading for decision tree induction</a>", November 2010.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-49997547598100766322016-01-04T12:05:00.000+01:002016-01-04T12:05:33.495+01:00Tanagra website statistics for 2015The year 2015 ends, 2016 begins. I wish you all a very happy year 2016.<br /><br />A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 255,386 times this year, 700 visits per day.<br /><br />Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there are 1,848,033 visits (639 visits per day).<br /><br />Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, Brazil, ...<br /><br />Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Science: course materials, tutorials, links to other documents available on line, etc.. This is not really surprising. I take more time myself to write booklets and tutorials, to study the behavior of different tools, of which Tanagra.<br /><br />Happy New Year 2016 to all.<br /><br />Ricco.<br /><span style="font-weight: bold;">Slideshow</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Frequentation_2015.pdf" target="_blank">Website statistics for 2015</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-76620897833434822032015-12-31T07:59:00.003+01:002015-12-31T07:59:25.954+01:00R online with R-FiddleR-Fiddle is a programming environment for R available online. It allows us to encode and to run a program written in R.<br /><br />Although R is free and there are also good free programming environments for R (e.g. R-Studio desktop, Tinn-R), this type of tool has several interests. It is suitable for mobile users who frequently change machine. If we have an Internet connection, we can work on a project without having to worry about the R installation on PCs. Collaborative work is another context in which this tool can be particularly advantageous. It allows us to avoid the transfer of files and the management of versions. Last, the solution allows us to work on a lightweight front-end, a laptop for example, and export the calculations on a powerful remote server (in the cloud as we would say today).<br /><br />In this tutorial, we will briefly review the features of R-Fiddle.<br /><br /><b>Keywords</b>: R software, R programming, cloud computing, linear discriminant analysis, logistic regression, classification tree, klaR package, rpart package, feature selection<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_R_Fiddle.pdf" target="_blank">en_Tanagra_R_Fiddle.pdf</a><br /><b>Files</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_r_fiddle.zip" target="_blank">en_r_fiddle.zip</a> <br /><b>References</b>:<br />R-Fiddle - <a href="http://www.r-fiddle.org/#/" target="_blank">http://www.r-fiddle.org/#/</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-26148071246478727802015-12-30T10:39:00.003+01:002015-12-30T10:40:48.996+01:00Random Forest - Boosting with R and PythonThis tutorial follows the slideshow devoted to the "<a href="http://data-mining-tutorials.blogspot.fr/2015/12/bagging-random-forest-boosting-slides.html" target="_blank">Bagging, Random Forest and Boosting</a>". We show the implementation of these methods on a data file. We will follow the same steps as the slideshow i.e. we first describe the construction of a decision tree, we measure the prediction performance, and then we see how ensemble methods can improve the results. Various aspects of these methods will be highlighted: the measure of the variable importance, the influence of the parameters, the influence of the characteristics of the underlying classifier (e.g. controlling the tree size), etc.<br /><br />As a first step, we will focus on R (rpart, adabag and randomforest packages) and Python (scikit-learn package). We can multiply analyses by programming. Among others, we can evaluate the influence of parameters on the performance. As a second step, we will explore the capabilities of software (<b><span style="color: #38761d;">Tanagra</span></b> and <span style="color: #38761d;"><b>Knime</b></span>) providing turnkey solutions, very simple to implement, more accessible for people which do not like programming.<br /><br /><b>Keywords</b>: R software, R programming, decision tree, classification tree, adabag package, rpart package, randomforest package, Python, scikit-learn package, bagging, boosting, random forest<br /><b>Components</b>: BAGGING, RND TREE, BOOSTING, C4.5, DISCRETE SELECT EXAMPLES<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_RandomForest_Boosting.pdf" target="_blank">Bagging, Random Forest et Boosting</a><br /><b>Files</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/boosting_randomforest_en.zip" target="_blank">randomforest_boosting_en.zip</a> <br /><b>References</b>:<br />R. Rakotomalala, "<a href="http://data-mining-tutorials.blogspot.fr/2015/12/bagging-random-forest-boosting-slides.html" target="_blank">Bagging, Random Forest, Boosting (slides)</a>", December 2015.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-43894264119707385622015-12-23T18:54:00.000+01:002015-12-23T18:54:33.180+01:00Bagging, Random Forest, Boosting (slides)This course material presents ensemble methods: bagging, random forest and boosting. These approaches are based on the same guiding idea : a set of base classifiers learned from the an unique learning algorithm are fitted to different versions of the dataset.<br /><br />For bagging and random forest, the models are fitted independently of bootstrap samples. Random Forest incorporates an additional mechanism in order to “decorrelate” the models which are necessarily decision trees.<br /><br />Boosting works in a sequential fashion. A model at the step (t) is fitted to a weighted version of the sample in order to correct the error of the model learned at the preceding step (t-1).<br /><br /><b>Keywords</b>: bagging, boosting, random forest, decision tree, rpart package, adabag package, randomforest package, R software<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/bagging_boosting.pdf" target="_blank">Bagging - Random Forest - Boosting</a><br /><b>References </b>:<br />Breiman L., "Bagging Predictors", Machine Learning, 26, p. 123-140, 1996.<br />Breiman L., "Random Forests", Machine Learning, 45, p. 5-32, 2001.<br />Freund Y., Schapire R., "Experiments with the new boosting algorithm", International Conference on Machine Learning, p. 148-156, 1996.<br />Zhu J., Zou H., Rosset S., Hastie T., "Multi-class AdaBoost", Statistics and Its Interface, 2, p. 349-360, 2009. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-81694893150391181402015-12-20T17:37:00.003+01:002015-12-20T17:38:43.651+01:00Python - Machine Learning with scikit-learn (slides)This course material presents some modules and classes of scikit-learn, a library for machine learning in Python.<br /><br />We focused on a typical classification process as a first step: the subdivision of the dataset into training and test sets; the learning of the logistic regression on the training sample; applying the model to the test set in order to obtain the predicted class values; the evaluation of the classifier using the confusion matrix and the calculation of the performance measurements. <br /><br />In the second step, we study other important domains of the classification task: the cross-validation error evaluation when we deal with a small dataset; the scoring process for direct marketing; the grid search for detecting the optimal parameters of algorithms for a given dataset; the feature selection issue.<br /><br /><b>Keywords</b>: python, numpy, pandas, scikit-learn, logistic regression, predictive analytics<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/PJ%20-%20en%20-%20machine%20learning%20avec%20scikit-learn.pdf" target="_blank">Machine Learning with scikit-learn</a><br /><b>Dataset and programs:</b> <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/fichiers/Python%20-%20en%20-%20J.zip" target="_blank">scikit-learn - Programs and dataset</a><br /><b>References</b> :<br />"scikit-learn -- Machine Learning in Python" on <a href="http://scikit-learn.org/stable/" target="_blank">scikit-learn.org</a><br /><a href="https://www.python.org/" target="_blank">Python - Official Site</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-66652245315243799652015-12-08T21:21:00.004+01:002015-12-08T21:21:42.001+01:00Python - Statistics with SciPy (slides)This course material presents the use of some modules of SciPy, a library for scientific computing in Python. We study especially the stats package, it allows to perform statistical tests such as comparison of means for independent and related samples, comparison of variances, measuring the association between two variables. We study also the cluster package, especially the k-means and the hierarchical agglomerative clustering algorithms.<br /><br />SciPy handles NumPy <a href="http://data-mining-tutorials.blogspot.fr/2015/10/python-handling-vectors-with-numpy.html" target="_blank">vectors</a> and <a href="http://data-mining-tutorials.blogspot.fr/2015/10/python-handling-matrices-with-numpy.html" target="_blank">matrices</a> which were presented previously.<br /><br /><b>Keywords</b>: python, numpy, scipy, descriptive statistics, cumulative distribution functions, sampling, random number generator, normality test, test for comparing populations, pearson correlation, spearman correlation, cluster analysis, k-means, hac, dendrogram<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/PI%20-%20en%20-%20statistiques%20avec%20scipy.pdf" target="_blank">scipy.stats and scipy.cluster</a><br /><b>Dataset and programs:</b> <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/fichiers/Python%20-%20en%20-%20I.zip" target="_blank">SciPy - Programs and dataset</a><br /><b>References</b> :<br /><a href="http://docs.scipy.org/doc/scipy/reference/" target="_blank">SciPy Reference Guide</a> sur SciPy.org<br /><a href="https://www.python.org/" target="_blank">Python - Official Site</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-87958124805384136432015-10-28T08:02:00.002+01:002015-10-28T08:02:30.055+01:00Python - Handling matrices with NumPy (slides)This course material presents the manipulation of matrices using NumPy. The array type is common to vectors and matrices. The special feature is the addition of a second dimension in order to have values within a rows x columns structure.<br /><br />The matrices pave the way to operators which play a fundamental role in statistical modeling and exploratory data analysis (e.g. matrix inversion, solving equations, calculation of eigenvalues and eigenvectors, singular value decomposition, etc.).<br /><br /><b>Keywords</b>: langage python, numpy, vector, matrix, array, creation, extraction<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/PH%20-%20en%20-%20matrices%20avec%20numpy.pdf" target="_blank">NumPy Matrices</a><br /><b>Datasets and programs:</b> <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/fichiers/Python%20-%20en%20-%20H.zip" target="_blank">Matrices</a><br /><b>References</b> :<br /><a href="http://docs.scipy.org/doc/numpy/reference/" target="_blank">NumPy Reference</a> sur SciPy.org<br />Haenel, Gouillart, Varoquaux, "<a href="https://scipy-lectures.github.io/" target="_blank">Python Scientific Lecture Notes</a>".<br /><a href="https://www.python.org/" target="_blank">Python - Official Site</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-10900977540492425952015-10-08T11:05:00.003+02:002015-10-08T11:05:20.732+02:00Python - Handling vectors with NumPy (slides)Python is becoming more and more popular in the eyes of Data Scientists. I decided to introduce Statistical Programming in Python among my teachings at the University (<a href="http://eric.univ-lyon2.fr/~ricco/cours/cours_programmation_python.html" target="_blank">reference page in French</a>).<br /><br />This first course material described the handling of vectors of NumPy library. The structure and functionality have a certain similarity with the vectors under R.<br /><br /><b>Keywords</b>: langage python, numpy, vector, array, creation, extraction<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/PG%20-%20en%20-%20numpy%20vectors.pdf" target="_blank">NumPy Vectors</a><br /><b>Datasets and programs:</b> <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/fichiers/Python%20-%20en%20-%20G.zip" target="_blank">Vectors</a><br /><b>References</b> :<br /><a href="http://docs.scipy.org/doc/numpy/reference/" target="_blank">NumPy Reference</a> sur SciPy.org<br />Haenel, Gouillart, Varoquaux, "<a href="https://scipy-lectures.github.io/" target="_blank">Python Scientific Lecture Notes</a>".<br /><a href="https://www.python.org/" target="_blank">Python - Official Site</a> Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-31348658429703728782015-06-02T19:14:00.000+02:002015-06-11T06:56:32.755+02:00Cross-validation, leave-one-out, bootstrap (slides)In supervised learning, it is commonly accepted that one should not use the same sample to build a predictive model and estimate its error rate. The error obtained under these conditions - called resubstitution error rate - is (very often) too optimistic, leaving to believe that the model will present an excellent performance in prediction.<br /><br />A typical approach is to divide the data into 2 parts (holdout approach): a first sample, said train sample is used to construct the model; a second sample, said test sample, is used to measure its performance. The measured error rate reflects honestly the model behavior in generalization. Unfortunately, on small dataset, this approach is problematic. By reducing the amount of data presented to the learning algorithm, we cannot learn correctly the underlying relation between the descriptors and the class attribute. At the same time, the part devoted to testing remains limited, the measured error has high variance.<br /><br />In this document, I present resampling techniques (cross-validation, leave-one-out and bootstrap) for estimating the error rate of the model constructed from the totality of the available data. A study on simulated data (the "waves" dataset; Breiman and al., 1984) is used to analyze the behavior of approaches according to various learning algorithms (decision trees, linear discriminant analysis, neural networks [perceptron]).<br /><br /><b>Keywords</b>: resampling, cross-validation, leave-one-out, bootstrap, error rate estimation, holdout, resubstitution, train, test, learning sample<br /><b>Components (Tanagra)</b>: CROSS-VALIDATION, BOOTSTRAP, TEST, LEAVE-ONE-OUT<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/resampling_evaluation.pdf" target="_blank">Error rate estimation</a><br /><b>References</b>:<br />A. Molinaro, R. Simon, R. Pfeiffer, « <a href="http://bioinformatics.oxfordjournals.org/content/21/15/3301.full" target="_blank">Prediction error estimation: a comparison of resampling methods</a> », in Bioinformatics, 21(15), pages 3301-3307, 2005.<br />Tanagra tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2009/07/resampling-methods-for-error-estimation.html" target="_blank">Resampling methods for error estimation</a>", July 2009.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-74376742396580338032015-04-11T08:37:00.000+02:002015-04-11T08:37:25.959+02:00R programming under HadoopThe aim of this tutorial is to show the programming of the famous "word count" algorithm from a set of files stored in HDFS file system.<br /><br />The "word count" is a state-of-the-art example for the programming under Hadoop. It is described everywhere on the web. But, unfortunately, the tutorials which describe the task are often not reproducible. The dataset are not available. The whole process, including the installation of the Hadoop framework, are not described. We do not know how to access to the files stored in the HDFS file system. In short, we cannot run programs and understand in details how they work.<br /><br />In this tutorial, we describe the whole process. We detail first the installation of a virtual machine which contains a single-node Hadoop cluster. Then we show how to install R and RStudio Server which allows us to write and run a program. Last, we write some programs based on the mapreduce scheme.<br /><br />The steps, and therefore the source of errors, are numerous. We will use many screenshots to actually understand each operation. This is the reason of this unusual presentation format for a tutorial.<br /><br /><span style="font-weight: bold;">Keywords</span>: big data, big data analytics, mapreduce, package rmr2, package rhdfs, hadoop, rhadoop, logiciel R, rstudio, rstudio server, cloudera, R language<br /><span style="font-weight: bold;">Tutorial</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Hadoop_with_R.pdf" target="_blank">en_Tanagra_Hadoop_with_R.pdf</a><span style="text-decoration: underline;"></span><br /><span style="font-weight: bold;">Files</span>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/hadoop_with_r.zip" target="_blank">hadoop_with_r.zip</a> <br /><b>References </b>:<br />Tanagra Tutorial, "<a href="http://data-mining-tutorials.blogspot.fr/2015/02/mapreduce-with-r.html" target="_blank">MapReduce with R</a>", Feb. 2015. <br />Hugh Devlin, "<a href="https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md" target="_blank">Mapreduce in R</a>", Jan. 2014.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-68367624136988178322015-02-22T09:46:00.003+01:002015-02-26T07:32:38.301+01:00MapReduce with RBig Data is a very popular topic these last years. The big data analytics refers to the process to discovering useful information or knowledge from big data. That is an important issue for organizations. In concrete terms, the aim is to extend, adapt or even create novel exploratory data analysis or data mining approaches to new data sources of which the main characteristics are “volume”, “variety” and “velocity”.<br /><br />Distributed computing is essential in the big data context. It is illusory to want infinitely increase the power of servers for following the exponential growth of information to process. The solution depends on the efficient cooperation of a myriad of networked computers, ensuring both the volume management and computing power. Hadoop is a solution commonly cited for this requirement. This is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. For the implementation of distributed programs, the MapReduce programming model plays an important role. The processing of large dataset can be implemented with parallel algorithms on a cluster of connected computers (nodes).<br /><br />In this tutorial, we are interested in MapReduce programming in R. We use the technology RHadoop of the Revolution Analytics Company. The "rmr2" package in particular allows to learn the MapReduce programming without having to install the Hadoop environment which is already sufficiently complicated. There are some tutorials about this subject on the web. The one of Hugh Devlin (January 2014) is undoubtedly one of the most interesting . But, it is perhaps too sophisticated for the students which are not very familiar with the programming in R. So I decided to start afresh with very simple examples in a first time. Then, in a second time, we progress by programming a simple data mining algorithm such as the multiple linear regression.<br /><br /><b>Keywords</b>: big data, big data analytcis, mapreduce, rmr2 package, hadoop, rhadoop, one-way anova, linear regression<br /><b>Tutorial</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_MapReduce.pdf" target="_blank">en_Tanagra_MapReduce.pdf</a><br /><b>Dataset</b>: <a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_mapreduce_with_r.zip" target="_blank">en_mapreduce_with_r.zip</a><br /><b>References</b>:<br />Hugh Devlin, "<a href="https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md" target="_blank">Mapreduce in R</a>", Jan. 2014. <br />Tutoriel Tanagra, "<a href="http://data-mining-tutorials.blogspot.fr/2013/10/parallel-programming-in-r.html" target="_blank">Parallel programming in R</a>", october 2013. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-77727219996010166022014-12-12T11:01:00.003+01:002014-12-12T11:01:48.543+01:00Correlation analysis (slides)The aim of the correlation analysis is to characterize the existence, the nature and the strength of the relationship between two quantitative variables. The visual inspection of scatter plots is a prime instrument in a first step, when we have no idea about the form of the underlying relationship between the variables. But, in second step, we need statistical tools to measure the strength of the relationship and to assess its significance.<br /><br />In these slides, we present the Pearson's product-moment correlation. We show how to estimate its value using a sample. We present the inferential tools which enable to realize hypothesis testing and confidence interval estimation.<br /><br />But the Pearson correlation is appropriate only to characterize linear relationship. We study the possible solutions for problematic situations with, among others, the Spearman's rank correlation coefficient (Spearman's rho).<br /><br />Last, the partial correlation coefficient and the related inferential tools are described.<br /><br /><b>Keywords</b>: correlation, partial correlation, pearson, spearman, hypothesis testing, significance, confidence interval<br /><b>Components (Tanagra)</b>: LINEAR CORRELATION<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/cours/english/Analyse_de_Correlation.pdf" target="_blank">Correlation analysis</a><br /><b>References</b>:<br />M. Plonsky, “<a href="http://www4.uwsp.edu/psych/stat/7/correlat.htm" target="_blank">Correlation</a>”, Psychological Statistics, 2014.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-21585521162457904332014-12-02T12:05:00.002+01:002014-12-02T12:05:45.264+01:00Clustering of categorical variables (slides)The aim of clustering of categorical variables is to group variables according to their relationship. The variables in the same cluster are highly related; variables in different clusters are weakly related. In these slides, we describe an approach based on the Cramer’s V measure of association. We observe that the approach can highlight subset of variables which is useful - for instance - in a variable selection process for a subsequent supervised learning task. But, on the other hand, we have no indication about the nature of these associations. The interpretation of the groups is not obvious.<br /><br />This leads us to deepen the analysis and to take an interest in the clustering of the categories of nominal variables. An approach based on a measure of similarity between categories using the indicator variables (dummy variables) is described. Other approaches are also reviewed. The main advantage of this kind of analysis (clustering of categories) is that we can easily interpret the underlying nature of the groups.<br /><br /><b>Keywords</b>: categorical variables, qualitative variables, categories, clustering, clustering variables, latent variable, cramer's v, dice's index, clusters, groups, bottom-up, hierarchical agglomerative clustering, hac, top down, mca, multiple correspondence analysis<br /><b>Components (Tanagra)</b>: CATVARHCA<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/classif_variables_quali.pdf" target="_blank">Clustering of categorical variables</a><br /><b>References</b>:<br />H. Abdallah, G. Saporta, « Classification d’un ensemble de variables qualitatives » (Clustering of a set of categorical variables), in Revue de Statistique Appliquée, Tome 46, N°4, pp. 5-26, 1998.<br />F. Harrell Jr, « <a href="http://cran.r-project.org/web/packages/Hmisc/index.html" target="_blank">Hmisc: Harrell Miscellaneous</a> », version 3.14-5.Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-80755158326705118592014-11-19T17:31:00.003+01:002014-11-19T17:38:41.417+01:00Discretization of continuous attributes (slides)The discretization consists to transform a continuous attribute into a discrete (ordinal) attribute. The process determines a finite number of intervals from the available values, for which discrete numerical values are assigned. The two main issues of the process are: how to determine the number of intervals; how to determine the cut points.<br /><br />In this slides, we present some discretization methods for the unsupervised and supervised contexts.<br /><br /><b>Keywords</b>: discretization, data preprocessing, chi-merge, mdlpc, equal-frequency, equal-width, clustering, top-down, bottom-up, feature construction<br /><b>Components (Tanagra)</b>: EQFREQ DISC, EQWIDTH DISC, MDLPC, BINARY BINNING, CONT TO DISC<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/discretisation.pdf" target="_blank">Discretization</a><br /><b>Tutorials</b>:<br />Tanagra Tutorials, "<a href="http://data-mining-tutorials.blogspot.fr/2010/05/discretization-of-continuous-features.html" target="_blank">Discretization of continuous features</a>", may 2010. Tanagranoreply@blogger.comtag:blogger.com,1999:blog-5496815755861370799.post-25048382436072428772014-09-24T18:11:00.001+02:002014-09-24T18:11:11.701+02:00Clustering variables (slides)The aim of clustering variables is to divide a set of numeric variables into disjoint clusters (subset of variables). In these slides, we present an approach based on the concept of latent component. A subset of variables is summarized by a latent component which is the first factor from the principal component analysis. This is a kind of "centroid" variable which maximizes the sum of the squared correlation with the existing variables. Various clustering algorithms based on this idea are described: a hierarchical agglomerative algorithm; a top down approach; and an approach which is inspired by the k-means method.<br /><br /><b>Keywords</b>: clustering, clustering variables, latent variable, latent component, clusters, groups, bottom-up, hierarchical agglomerative clustering, top down, varclus, k-means, pca, principal component analysis<br /><b>Components (Tanagra)</b>: VARHCA, VARKMEANS, VARCLUS<br /><b>Slides</b>: <a href="http://eric.univ-lyon2.fr/~ricco/cours/slides/en/classification_de_variables.pdf" target="_blank">Clustering variables</a><br /><b>Tutorials</b>:<br />Tanagra tutorials, "<a href="http://data-mining-tutorials.blogspot.fr/2008/11/variable-clustering-varclus.html" target="_blank">Variable clustering (VARCLUS)</a>", 2008. Tanagranoreply@blogger.com