<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5496815755861370799</id><updated>2012-02-10T04:30:59.895-08:00</updated><category term='Diagram management'/><category term='Feature Construction'/><category term='Association rules'/><category term='Supervised Learning'/><category term='Data file handling'/><category term='Sipina'/><category term='PLS Regression'/><category term='Software Comparison'/><category term='Statistical methods'/><category term='Feature Selection'/><category term='Clustering'/><category term='Exploratory Data Analysis'/><category term='Tanagra'/><category term='Decision tree'/><category term='Regression analysis'/><title type='text'>Tanagra - Data Mining Tutorials</title><subtitle type='html'>This Web log maintains an alternative layout of the tutorials about Tanagra. Each entry describes shortly the subject, it is followed by the link to the tutorial (pdf) and the dataset. The technical references (book, papers, website,...) are also provided. In some tutorials, we compare the results of Tanagra with other free software such as Knime, Orange, R software, RapidMiner (Yale), Sipina or Weka.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://data-mining-tutorials.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default?start-index=101&amp;max-results=100'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>157</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3524244788060854256</id><published>2012-02-10T04:30:00.000-08:00</published><updated>2012-02-10T04:30:59.909-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Feature Selection'/><title type='text'>Logistic regression on large dataset</title><content type='html'>The programming of fast and reliable tools is a constant challenge for a computer scientist. In the data mining context, this leads to a better capacity to handle large datasets. When we build the final model that we want to deploy, the quickness is not really important. But in the exploratory phase where we search the best model, it is decisive. It improves our chance to obtain the best model simply because we can try more configurations.&lt;br /&gt;&lt;br /&gt;I have tried many solutions to improve the calculation times of the logistic regression. In fact, I think the performance rests heavily on the optimization algorithm used. The source code of Tanagra shows that I have greatly hesitated. Some studies have helped me about the right choice.&lt;br /&gt;&lt;br /&gt;Several tools propose the logistic regression. It is interesting to compare their calculation times and memory occupation. I have already studied this kind of comparison in the past . The novelty here is that I use a new operating system (64 bit version of Windows 7), and some tools are especially intended for this system. The calculating capabilities are greatly improved for these tools. For this reason, I have increased the dataset size. Moreover, to make more difficult the variable selection process, I added predictive attributes that are correlated to the original descriptors, but not to the class attribute. They have not to be selected in the final model.&lt;br /&gt;&lt;br /&gt;In this paper, in addition to &lt;b&gt;&lt;span style="color: #6aa84f;"&gt;Tanagra 1.4.14&lt;/span&gt;&lt;/b&gt; (32 bit), we use &lt;b&gt;&lt;span style="color: #6aa84f;"&gt;R 2.13.2&lt;/span&gt;&lt;/b&gt; (64 bit), &lt;b style="color: #6aa84f;"&gt;Knime 2.4.2&lt;/b&gt; (64 bit), &lt;b style="color: #6aa84f;"&gt;Orange 2.0b&lt;/b&gt; (build 15 oct2011, 32 bit) and &lt;b&gt;&lt;span style="color: #6aa84f;"&gt;Weka 3.7.5&lt;/span&gt;&lt;/b&gt; (64 bit).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords&lt;/b&gt;: logistic regression, software comparison, glm, stepAIC, R software, knime, orange, weka&lt;br /&gt;&lt;b&gt;Components&lt;/b&gt;: BINARY LOGISTIC REGRESSION, FORWARD LOGIT&lt;br /&gt;&lt;b&gt;Tutorial&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Perfs_Bis_Logistic_Reg.pdf" target="_blank"&gt;en_Tanagra_Perfs_Bis_Logistic_Reg.pdf&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Dataset&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/perfs_bis_logistic_reg.zip" target="_blank"&gt;perfs_bis_logistic_reg.zip&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References&lt;/b&gt;:&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/12/logistic-regression-software-comparison.html"&gt;Logistic regression - Software comparison&lt;/a&gt;", december 2008.&lt;br /&gt;T.P. Minka, « &lt;a href="http://research.microsoft.com/en-us/um/people/minka/papers/logreg/minka-logreg.pdf" target="_blank"&gt;A comparison of numerical optimizers for logistic regression&lt;/a&gt; », 2007.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3524244788060854256?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3524244788060854256'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3524244788060854256'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2012/02/logistic-regression-on-large-dataset.html' title='Logistic regression on large dataset'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-642080682076645612</id><published>2012-02-04T01:32:00.000-08:00</published><updated>2012-02-04T01:33:35.043-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Tanagra - Version 1.4.42</title><content type='html'>The &lt;span style="color: #cc0000;"&gt;Tanagra.xla add-in&lt;/span&gt; for Excel can work now for both the&lt;span style="color: #93c47d;"&gt; &lt;span style="color: #cc0000;"&gt;32 and 64-bit versions of EXCEL&lt;/span&gt;&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;With the FastMM memory manager, &lt;span style="color: #cc0000;"&gt;Tanagra can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows&lt;/span&gt;. The processing capabilities, especially about the handling of large datasets, are improved.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #cc0000;"&gt;The importation of the tab-delimited text file format and xls file format (Excel 97-2003) is made safer&lt;/span&gt;. Previously, the importation is interrupted and the dataset is truncated when an invalid line is read (with missing or inconsistent values). Now, Tanagra skips the line and continues on the next rows. The number of skipped lines is reported into the importation report.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Donwload page&lt;/span&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/en/contenu_telechargement_logiciel_tanagra.html" target="_blank"&gt;setup&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-642080682076645612?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/642080682076645612'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/642080682076645612'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2012/02/tanagra-version-1442.html' title='Tanagra - Version 1.4.42'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5498871432178695292</id><published>2012-01-18T07:15:00.000-08:00</published><updated>2012-01-19T12:28:42.962-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>ARS into the SIPINA package</title><content type='html'>&lt;span style="color: #6aa84f;"&gt;Association Rule Software&lt;/span&gt; (ARS) is a basic tool which extracts association rules from attribute-value datasets (categorical or binary attributes). It is distributed with the SIPINA package which includes: a tool for the supervised learning framework, especially the decision tree induction (SIPINA RESEARCH); a tool for the linear regression (REGRESS); and thus, ARS for the association rule mining.&lt;br /&gt;&lt;br /&gt;ARS encodes automatically the categorical attributes in dummy variables. If you want use a continuous attributes, you must discretize them before.&lt;br /&gt;&lt;br /&gt;This tutorial describes shortly the use of the Association Rule Software (ARS). Compared with the previous version, the GUI of the one&lt;span style="color: lime;"&gt; &lt;span style="color: #38761d;"&gt;incorporated into the SIPINA 3.8&lt;/span&gt;&lt;/span&gt; package is simplified.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords&lt;/b&gt;: association rule mining, support, confidence, lift, conviction &lt;br /&gt;&lt;b&gt;Download&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/sipina_download.html" target="_blank"&gt;Sipina setup file&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Tutorial&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/english_how_to_use_assoc_rule_soft.pdf" target="_blank"&gt;How to use ARS&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References&lt;/b&gt;:&lt;br /&gt;Wikipedia - &lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning" target="_blank"&gt;Association rule learning&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5498871432178695292?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5498871432178695292'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5498871432178695292'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2012/01/ars-into-sipina-package.html' title='ARS into the SIPINA package'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-838180549957383950</id><published>2012-01-18T07:08:00.000-08:00</published><updated>2012-01-19T12:28:27.801-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><category scheme='http://www.blogger.com/atom/ns#' term='Regression analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Sipina - Version 3.8</title><content type='html'>The tools (SIPINA RESEARCH, REGRESS and ASSOCIATION RULE SOFTWARE) included in the SIPINA distribution have been updated with some improvements.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #990000;"&gt;SIPINA.XLA&lt;/span&gt;. The add-in for Excel can work now with either for the 32 or 64-bit versions of EXCEL.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #990000;"&gt;Importation of text data files&lt;/span&gt;. Processing time has been improved. This improvement reduces also the transferring time when we use the SIPINA.XLA add-in for Excel (which uses a temporary file in the text file format).&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #990000;"&gt;Association rule software&lt;/span&gt;. The GUI has been simplified; the display of the rules is made more readable.&lt;br /&gt;&lt;br /&gt;Because they are internally based on the FastMM memory management, &lt;span style="color: #93c47d;"&gt;these tools can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows&lt;/span&gt;. The processing capabilities are improved.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords&lt;/b&gt;: sipina, decision tree induction, association rule, multiple linear regression&lt;br /&gt;&lt;b&gt;Sipina website&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/sipina.html" target="_blank"&gt;Sipina&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Download&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/sipina_download.html" target="_blank"&gt;Setup file&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References&lt;/b&gt;:&lt;br /&gt;Tanagra - &lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/sipina-add-in-for-excel.html"&gt;SIPINA add-in for Excel&lt;/a&gt;&lt;br /&gt;Tanagra - &lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/tanagra-add-in-for-office-2007-and.html"&gt;Tanagra add-in for Excel 2007 and 2010&lt;/a&gt;&lt;br /&gt;Delphi Programming Resource - &lt;a href="http://delpres.blogspot.com/2008/07/fastmm-fast-memory-manager-replacement.html" target="_blank"&gt;FastMM, a Fast Memory Manager &lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-838180549957383950?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/838180549957383950'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/838180549957383950'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2012/01/sipina-version-38.html' title='Sipina - Version 3.8'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4312137056488766017</id><published>2012-01-02T07:28:00.000-08:00</published><updated>2012-01-02T07:28:23.298-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra website statistics for 2011</title><content type='html'>The year 2011 ends, 2012 begins. I wish you all a very happy year 2012.&lt;br /&gt;&lt;br /&gt;A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 281,352 times this year, 770 visits per day. For comparison, we had  662 daily visits in 2010, 520 in 2009, 349 in 2008.&lt;br /&gt;&lt;br /&gt;Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries. In terms of non-francophone countries, we observe mainly the United States, India, UK, Italy, Brazil, Germany,...&lt;br /&gt;&lt;br /&gt;Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Mining: course materials, tutorials, links to other documents available on line, etc.. This is not really surprising. I take more time myself to write booklets and tutorials, to study the behavior of different software, of which Tanagra.&lt;br /&gt;&lt;br /&gt;Happy New Year 2012 to all.&lt;br /&gt;&lt;br /&gt;Ricco.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Slideshow&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Frequentation_2011.pdf" target="_blank"&gt;Website statistics for 2011&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4312137056488766017?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4312137056488766017'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4312137056488766017'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2012/01/tanagra-website-statistics-for-2011.html' title='Tanagra website statistics for 2011'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6490662296817463764</id><published>2011-12-30T23:07:00.000-08:00</published><updated>2011-12-30T23:07:52.569-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Tanagra add-in for Excel 2010 - 64-bit version</title><content type='html'>The current Tanagra.xla add-in is valid to the 32-bit version of Excel (up to Excel 2010), even if we are working under 64-bit version of Windows. It does not operate on the other hand if we want to connect the 64-bit version of Excel to Tanagra. We must modify the add-in source code. &lt;span style="color: #6aa84f;"&gt;These modifications are needed up to 1.4.41 version of Tanagra. They will be automatically introduced for the upcoming versions&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show the procedure to be followed for this upgrade. The screenshots have been achieved under a French version of Excel 2007 here, but I think (I hope) that the adaptation to other versions (Excel 2010 and/or other languages) is easy.&lt;br /&gt;&lt;br /&gt;Thank you very much to Mrs. Nathalie Jourdan-Salloum which has pointed out this problem and has suggested to me the right solution.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords&lt;/b&gt;: data importation, xls, xlsx, excel file format, macro-complémentaire, add-in, addin, add-on&lt;br /&gt;&lt;b&gt;Tutorial&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Addin_Excel_64_bit.pdf" target="_blank"&gt;en_Tanagra_Addin_Excel_64_bit.pdf&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References&lt;/b&gt;: &lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/tanagra-add-in-for-office-2007-and.html"&gt;Tanagra add-in for Office 2007 and Office 2010&lt;/a&gt;".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6490662296817463764?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6490662296817463764'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6490662296817463764'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/12/tanagra-add-in-for-excel-2010-64-bit.html' title='Tanagra add-in for Excel 2010 - 64-bit version'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-446923616892985981</id><published>2011-12-11T08:24:00.001-08:00</published><updated>2011-12-11T08:30:06.496-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Dealing with very large dataset (continuation)</title><content type='html'>&amp;nbsp;Because I have recently updated my operating system (OS), I am wondering how the 64-bit versions of &lt;b style="color: #990000;"&gt;Knime 2.4.2&lt;/b&gt; and &lt;b style="color: #990000;"&gt;RapidMiner 5.1.011&lt;/b&gt; could handle a very large dataset, which cannot be loaded into main memory on a 32-bit OS. This article completes a previous study where we deal with a moderate sized dataset with 500,000 instances and 22 variables. Here, we handle a dataset with &lt;b style="color: #6aa84f;"&gt;9,634,198 instances&lt;/b&gt; and &lt;b style="color: #6aa84f;"&gt;41 variables&lt;/b&gt;. We have already used this dataset in another tutorial. We showed that we cannot perform a decision tree induction on this kind of database without a swapping system, which is implemented into the SIPINA, on a 32-bit OS. We note that Tanagra can handle the dataset, but this is because it encodes the values of the categorical attributes with a byte. The memory occupation remains moderate.&lt;br /&gt;&lt;br /&gt;In this tutorial, I analyze the behavior of the 64-bit Knime and RapidMiner on this database. I use 64-bit OS and tools, but I have "only" 4 GB of available memory on my personal computer.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords&lt;/b&gt;: very large dataset, decision tree, sampling, sipina, knime, rapidminer&lt;br /&gt;&lt;b&gt;Components&lt;/b&gt;: ID3&lt;br /&gt;&lt;b&gt;Tutorial&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Tree_Very_Large_Dataset.pdf" target="_blank"&gt;en_Tanagra_Tree_Very_Large_Dataset.pdf&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Dataset&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/twice-kdd-cup-discretized-descriptors.zip" target="_blank"&gt;twice-kdd-cup-discretized-descriptors.zip&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References&lt;/b&gt;:&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2010/01/dealing-with-very-large-dataset-in.html"&gt;Dealing with very large dataset in Sipina&lt;/a&gt;".&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2011/10/decision-tree-and-large-dataset-follow.html"&gt;Decision tree and large dataset (continuation)&lt;/a&gt;".&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-large-dataset.html"&gt;Decision tree and large dataset&lt;/a&gt;".&lt;br /&gt;Tanagra, "L&lt;a href="http://data-mining-tutorials.blogspot.com/2009/10/local-sampling-approach-for-decision.html"&gt;ocal sampling for decision tree learning&lt;/a&gt;".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-446923616892985981?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/446923616892985981'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/446923616892985981'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/12/dealing-with-very-large-dataset.html' title='Dealing with very large dataset (continuation)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6587924256753614730</id><published>2011-10-29T08:11:00.000-07:00</published><updated>2012-01-09T22:10:13.809-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Decision tree and large dataset (continuation)</title><content type='html'>One of the exciting aspects of computing is that things are changing very quickly. The machines are ever more efficient, the operating systems are improved, the software also. Since writing an old tutorial about &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-large-dataset.html"&gt;the induction of decision tree on a large dataset&lt;/a&gt;, I have a new computer and I use a 64 bit OS (Windows 7). Some of the tools studied propose a 64 bit version (Knime, RapidMiner, R). I wonder how behave the various tools in this new context. To do that, I renewed the same experiment. &lt;br /&gt;&lt;br /&gt;We note that a more efficient computer allows to improve the computation time (about 20%). The specific gain for a 64 bit version is relatively low, but it is real (about 10%). And some tools are clearly improved their programming of the decision tree induction (Knime, RapidMiner). On the other hand, we observe that the memory occupation remains stable for the most of the tools in the new context.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords&lt;/b&gt;&lt;b&gt;:&lt;/b&gt; c4.5, decision tree, large dataset, wave dataset, knime2.4.2, orange 2.0b, r 2.13.2, rapidminer 5.1.011, sipina 3.7, tanagra 1.4.41, weka 3.7.4, windows 7 - 64 bits&lt;br /&gt;&lt;b&gt;Components:&lt;/b&gt; SUPERVISED LEARNING, C4.5&lt;br /&gt;&lt;b&gt;Tutorial:&lt;/b&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Perfs_Comp_Decision_Tree_Suite.pdf" target="_blank"&gt;en_Tanagra_Perfs_Comp_Decision_Tree_Suite.pdf&lt;/a&gt; &lt;br /&gt;&lt;b&gt;Screenshots :&lt;/b&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/copie_ecran_tree_on_large_dataset_continued.pdf" target="_blank"&gt;Experiment screenshots&lt;/a&gt;.&lt;br /&gt;&lt;b&gt;Dataset:&lt;/b&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/wave500k.zip" target="_blank"&gt;wave500k.zip&lt;/a&gt;&lt;b&gt;&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;References:&lt;/b&gt;&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-large-dataset.html"&gt;Decision tree and large dataset&lt;/a&gt;". &lt;br /&gt;R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6587924256753614730?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6587924256753614730'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6587924256753614730'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/10/decision-tree-and-large-dataset-follow.html' title='Decision tree and large dataset (continuation)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6057790099445204562</id><published>2011-09-24T23:45:00.000-07:00</published><updated>2011-09-24T23:49:35.372-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><title type='text'>A PRIORI PT updated</title><content type='html'>A PRIORI PT is a tool dedicated for the extraction of association rules. This is one of the few components of Tanagra based on external library. We use the Borgelt's "apriori.exe" program. Until the version 1.4.40 of Tanagra, we used the 4.31 version of "apriori.exe". &lt;span style="color: rgb(153, 0, 0);"&gt;From the Tanagra &lt;span style="font-weight: bold;"&gt;1.4.41&lt;/span&gt;&lt;/span&gt;, we introduce the latest update 5.57 (2011/09/02). Even if the settings of the tool are slightly modified, we observe that the extracted rules and the readings of the results are identical.&lt;br /&gt;&lt;br /&gt;We take again a former tutorial to describe the behavior of this component (&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/association-rule-learning-using-prefix.html"&gt;Association Rule Learning using A PRIORI PT&lt;/a&gt;). Thus, we do not detail the construction of the diagram here. We try above all to highlight the improvement of the library, especially about the computation time. We observe that this improvement is really impressive.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: association rule, large dataset&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: A priori PT&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_AprioriPT_Updated.pdf" target="_blank"&gt;en_Tanagra_AprioriPT_Updated.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/assoc_census.zip" target="_blank"&gt;assoc_census.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;: C. Borgelt, "&lt;a href="http://www.borgelt.net/apriori.html" target="_blank"&gt;A priori - Association Rule Induction / Frequent Item Set Mining&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6057790099445204562?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6057790099445204562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6057790099445204562'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/09/priori-pt-updated.html' title='A PRIORI PT updated'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6024904027606357220</id><published>2011-09-22T00:12:00.000-07:00</published><updated>2011-09-22T00:17:03.841-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.41</title><content type='html'>&lt;span style="color: rgb(51, 204, 0);"&gt;A PRIORI PT&lt;/span&gt;. This component generates association rules. It is based on the Borgelt's  &lt;a href="http://www.borgelt.net/apriori.html" target="_blank"&gt;apriori.exe&lt;/a&gt; program which has been recently updated (2011/09/02 - 5.57 version). The improvement of this new version, in terms of calculation time, is impressive.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 204, 0);"&gt;FREQUENT ITEMSETS&lt;/span&gt;. Also based on the Borgelt's apriori.exe program (version 5.57), this component generates frequent (or closed, maximum, generators) itemsets.&lt;br /&gt;&lt;br /&gt;Some tutorials are coming soon to describe the use of these new tools.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Donwload page&lt;/span&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/en/contenu_telechargement_logiciel_tanagra.html" target="_blank"&gt;setup&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6024904027606357220?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6024904027606357220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6024904027606357220'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/09/tanagra-version-1441.html' title='Tanagra - Version 1.4.41'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7210427184341805629</id><published>2011-09-20T07:54:00.000-07:00</published><updated>2011-09-20T07:59:18.813-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>New GUI for RapidMiner 5.0</title><content type='html'>RapidMiner is a very popular data mining tool. It is (one of) the most used by the data miners  according to the annual Kdnuggets polls (2011, 2010, 2009, 2008, 2007). There are two versions. We describe here the Community Edition which freely downloadable from the editor's website.&lt;br /&gt;&lt;br /&gt;The new RapidMiner 5.0 has a new graphical user interface which is very similar to that of Knime. The organization of the workspace is the same. The sequence of data processing (using operators) is described with a diagram called "process" into the RapidMiner documentation. In fact, this version 5.0 joined the presentation adopted by the vast majority of data mining software. Some features are shared with many tools, among others: the connection to the R software; the meta-nodes which implements a loop or a standard succession of operations; the description of the methods underlying operators which is continuously in the right part of the main window.&lt;br /&gt;&lt;br /&gt;RapidMiner 5.0 having evolved substantially (compared with previous versions e.g. the version 4.6 described in &lt;a href="http://data-mining-tutorials.blogspot.com/2010/04/wrapper-for-feature-selection.html"&gt;one of our tutorials&lt;/a&gt;). I thought it was appropriate to study this in detail, evaluating its behavior in the context of a standard data mining analysis. We want to implement the following process: (1) creating a decision tree from a labeled dataset; (2) exporting the model (the classification tree) into a external file (PMML format) in order to a deployment thereafter; (3) assessing the model performance using a cross-validation resampling scheme; (4) applying the model on a set of unlabeled instances, the results, i.e. the values of the descriptors and the assigned class, must be exported into a CSV file. These are standard data mining tasks. We have described them in many tutorials. We want to check if it is easy to implement them with this new version of RapidMiner. Indeed, with the previous version, defining some sequences of operations was complicated. Implementing a &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-cross-validation.html"&gt;cross-validation&lt;/a&gt; for instance was not really intuitive.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: rapidminer, knime, cross-validation, decision tree, classification tree, deployment&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_RapidMiner_5.pdf" target="_blank"&gt;en_Tanagra_RapidMiner_5.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/adult_rapidminer.zip" target="_blank"&gt;adult_rapidminer.zip&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;Rapid-I, "&lt;a href="http://rapid-i.com/content/view/181/190/lang,en/" target="_blank"&gt;RapidMiner&lt;/a&gt;"&lt;br /&gt;Knime, "&lt;a href="http://www.knime.org/knime-desktop" target="_blank"&gt;Knime Desktop&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7210427184341805629?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7210427184341805629'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7210427184341805629'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/09/new-gui-for-rapidminer-50.html' title='New GUI for RapidMiner 5.0'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6573437796943113794</id><published>2011-09-19T06:00:00.000-07:00</published><updated>2011-09-19T06:07:09.731-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='PLS Regression'/><category scheme='http://www.blogger.com/atom/ns#' term='Regression analysis'/><title type='text'>Regression model deployment</title><content type='html'>Model deployment is one of the main objectives of the data mining process. We want to apply a model learned on a training set on unseen cases i.e. any people coming from the population. In the classification framework, the aim is to assign to the instance its class value from their description [e.g. &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/apply-classifier-on-new-dataset.html"&gt;Apply a classifier on a new dataset (Deployment)&lt;/a&gt;]. In the clustering framework, we try to detect the group which is as similar as possible to the instance according their characteristics (e.g. &lt;a href="http://data-mining-tutorials.blogspot.com/2008/12/k-means-classification-of-new-instance.html"&gt;K-Means - Classification of a new instance&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;We are concerned about the regression framework here . The aim is to predict the values of the dependent variable for unseen instances (or unlabeled instances) from the observed values on the independent variables. The process is rather basic if we handle a linear regression model. We apply the computed parameters on the unseen instances. But, it becomes difficult when we want to treat more complex models such as support vector regression with nonlinear kernels, or the models elaborated from a combination of techniques (e.g. regression from the factors of a principal component analysis). In this context, it is essential that the deployment process is directly ensured by the data mining tool.&lt;br /&gt;&lt;br /&gt;With Tanagra, it is possible to easily deploy the regression models, even when they are the result of a combination of technique. Simply, we must prepare the data file in a particular way. In this tutorial, we describe below how to organize the data file in order to deploy various models in an unified framework: a linear regression model, a PLS regression model, a support vector regression model with a RBF (radial basis function) kernel, a regression tree model , a regression model from the factors of a principal component analysis. Then, we export the results (the predicted values for the dependent variable) in a new data file. Last, we check if the predicted values are similar according to the various models.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: model deployment, linear regression,  pls regression, support vector regression, SVR, regression tree,  cart, principal component analysis, pca, regression of factor scores&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: MULTIPLE LINEAR REGRESSION, PLS REGRESSION, PLS SELECTION, C-RT  REGRESSION TREE, EPSILON SVR, PRINCIPAL COMPONENT ANALYSIS, RECOVER  EXAMPLES, EXPORT DATASET, LINEAR CORRELATION&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Multiple_Regression_Deployment.pdf" target="_blank"&gt;en_Tanagra_Multiple_Regression_Deployment.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/housing.xls" target="_blank"&gt;housing.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt; :&lt;br /&gt;R. Rakotomalala, &lt;a href="http://tutoriels-data-mining.blogspot.com/2011/01/regression-lineaire-multiple-diaporama.html"&gt;Régression linéaire multiple - Diaporama&lt;/a&gt; (in French)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6573437796943113794?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6573437796943113794'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6573437796943113794'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/09/regression-model-deployment.html' title='Regression model deployment'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7176875015659908960</id><published>2011-08-26T23:13:00.000-07:00</published><updated>2011-08-26T23:21:12.227-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Data Mining with R - The Rattle Package</title><content type='html'>R (&lt;a href="http://www.r-project.org/" target="_blank"&gt;http://www.r-project.org/&lt;/a&gt;) is one of the most exciting free data mining software projects of these last years. Its popularity is absolutely justified (see Kdnuggets Polls - Data Mining/ Analytic Tools Used - &lt;a href="http://www.kdnuggets.com/polls/2011/tools-analytics-data-mining.html" target="_blank"&gt;2011&lt;/a&gt;). Among the reasons which explain this success, we distinguish two very interesting characteristics: (1) we can extend almost indefinitely the features of the tool with the packages; (2) we have a programming language which allows to perform easily sequences of complex operations.&lt;br /&gt;&lt;br /&gt;But this second property can be also a drawback. Indeed, some users do not want to learn a new programming language before being able to realize projects. For this reason, tools which allow to define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still remain a valuable alternative with the data miners.&lt;br /&gt;&lt;br /&gt;I&lt;span style="color: rgb(0, 153, 0);"&gt;n this tutorial, we present the "Rattle" package which allows to the data miners to use R without needing to know the associated programming language&lt;/span&gt;. All the operations are performed with simple clicks, such as for any software driven by menus. But, in addition, all the commands are stored. We can save them in a file. Then, in a new working session, we can easily repeat all the operations. Thus, we find one of the important properties which miss to the tools driven by menus.&lt;br /&gt;&lt;br /&gt;To describe the use of the rattle package, we perform an analysis similar to the one suggested by the rattle's author in its presentation paper (G.J. Williams, " Rattle : A Data Mining GUI for R", in The R Journal, volume 1 / 2, pages 45-55, December 2009). We perform the following steps: loading the data file; partitioning the instances into learning and test samples; specifying the types of the variables (target or input); computing some descriptive statistics; learning the predictive models from the learning sample; assessing the models on the test sample (confusion matrix, error rate, some curves).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: R software, R project, rpart, random forest, glm, decision tree, classification tree, logistic regression&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Rattle_Package_for_R.pdf" target="_blank"&gt;en_Tanagra_Rattle_Package_for_R.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/heart_for_rattle.txt" target="_blank"&gt;heart_for_rattle.txt&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Togaware, "&lt;a href="http://rattle.togaware.com/" target="_blank"&gt;Rattle&lt;/a&gt;"&lt;br /&gt;CRAN, "&lt;a href="http://cran.r-project.org/web/packages/rattle/index.html" target="_blank"&gt;Package rattle - Graphical user interface for data mining in R&lt;/a&gt;"&lt;br /&gt;G.J. Williams, "&lt;a href="http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf" target="_blank"&gt;Rattle: A Data Mining GUI for R&lt;/a&gt;", in The R Journal, Vol. 1/2, pages 45--55, december 2009.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7176875015659908960?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7176875015659908960'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7176875015659908960'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/08/data-mining-with-r-rattle-package.html' title='Data Mining with R - The Rattle Package'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6356475213443561375</id><published>2011-08-22T08:57:00.000-07:00</published><updated>2011-08-22T20:20:51.523-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Predictive model deployment with R (filehash)</title><content type='html'>Model deployment is the last task of the data mining steps. It corresponds to several aspects e.g. generating a report about the data exploration process, highlighting the useful results; applying models within an organization's decision making process; etc .&lt;br /&gt;&lt;br /&gt;In this tutorial, we look at the context of predictive data mining. We are concerned about the construction of the model from a labeled dataset; the storage of the model; the distribution of the model, without the dataset used for its construction; the application of the model on new instances in order to assign them a class label from their description (the values of the descriptors).&lt;br /&gt;&lt;br /&gt;We describe the &lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;filehash&lt;/span&gt; package for R which allows to deploy a model easily. The main advantage of this solution is that R can be launched under various operating systems. Thus, we can create a model with R under Windows; and apply the model in another environment, for instance with R under Linux. The solution can be easily generalized on a large scale because it is possible to launch R in batch mode. The update of the system will concern only the model file in the future.&lt;br /&gt;&lt;br /&gt;We will write three R programs to distinguish the steps of the deployment process. The first one constructs a model from the dataset and stores it into a binary file (filehash format). The second one loads the model in another R session and uses it to label new instances from a second data file. The predictions are stored in a data file (CSV file format). Last, the third program loads the predictions and another data file containing the observed labels for these instances, and calculates the confusion matrix and the generalization error rate.&lt;br /&gt;&lt;br /&gt;We use various predictive models in order to check the flexibility of the solutions. We tried the following ones: decision tree (rpart); logistic regression (glm); linear discriminant analysis (lda); linear discriminant analysis from factors of principal component analysis (lda + pca). This last one allowed to check if the system remains operational when we manipulate a combination of models.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: R software, filehash package, deployment, predictive model,  rpart, lda, pca, glm, decision tree, linear discriminant analysis,  logistic regression, principal component analysis, linear discriminant analysis on latent variables&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Deploying_Predictive_Models_with_R.pdf" target="_blank"&gt;en_Tanagra_Deploying_Predictive_Models_with_R.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/pima-model-deployment.zip" target="_blank"&gt;pima-model-deployment.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;R  package, "&lt;a href="http://cran.r-project.org/web/packages/filehash/index.html" target="_blank"&gt;Filehash : Simple key-value database&lt;/a&gt;"&lt;br /&gt;Kdnuggets, "&lt;a href="http://www.kdnuggets.com/polls/2009/deployment-data-mining-models.htm" target="_blank"&gt;Data mining deployment Poll&lt;/a&gt;"&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6356475213443561375?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6356475213443561375'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6356475213443561375'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/08/predictive-model-deployment-with-r.html' title='Predictive model deployment with R (filehash)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6733979158907687783</id><published>2011-08-18T04:41:00.000-07:00</published><updated>2011-08-18T04:46:44.768-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Regression analysis'/><title type='text'>REGRESS into the SIPINA package</title><content type='html'>Few people know it. In fact, several tools are installed when we launch the SETUP file of SIPINA (setup_stat_package.exe). This is the case of REGRESS which is intended to multiple linear regression.&lt;br /&gt;&lt;br /&gt;Even if a multiple linear regression procedure is incorporated to Tanagra, REGRESS can be useful essentially because it is very easy to use. It has the advantage of being very easy to handle while being consistent with a degree course in Econometrics. As such, it may be useful for anyone wishing to learn about the regression without too much get involved in the learning of a new software.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: regress, econometrics, multiple linear regression, outliers, influential points, normality tests, residuals, Jarque-Bera test, normal probability plot, sipina.xla, add-in&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/en_sipina_regress.pdf" target="_blank"&gt;en_sipina_regress.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/ventes-regression.xls" target="_blank"&gt;ventes-regression.xls&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;R. Rakotomalala, "&lt;a href="http://tutoriels-data-mining.blogspot.com/2011/05/regression-lineaire-simple-et-multiple.html" target="_blank"&gt;Econométrie - Régression Linéaire Simple et Multiple&lt;/a&gt;".&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/regress.htm" target="_blank"&gt;Multiple regression&lt;/a&gt;".&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6733979158907687783?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6733979158907687783'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6733979158907687783'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/08/regress-into-sipina-package.html' title='REGRESS into the SIPINA package'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5538024845921083630</id><published>2011-08-14T01:09:00.000-07:00</published><updated>2011-08-14T08:08:50.703-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='PLS Regression'/><title type='text'>PLS Regression - Software comparison</title><content type='html'>Comparing the behavior of tools is always a good way to improve them.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;To check and validate the implementation of methods&lt;/span&gt;. The validation of the implemented algorithms is an essential point for data mining tools. Even if two programmers use the same references (books, articles), the programming choice can modify the behavior of the approach (behaviors according to the interpretation of the convergence conditions for instance). The analysis of the source code is possible solution. But, if it is often available for free software, this is not the case for commercial tools. Thus, the only way to check them is to compare the results provided by the tools on a benchmark dataset . If there are divergences, we must explain them by analyzing the formulas used.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;To improve the presentation of results&lt;/span&gt;. There are certain standards to observe in the production of reports, consensus initiated by reference books and / or leader tools in the field. Some ratios should be presented in a certain way. Users need reference points.&lt;br /&gt;&lt;br /&gt;Our programming of the PLS approach is based on the Tenenhaus book (1998)   which, itself, make reference to the SIMCA-P  tool. Using the access to a limited version of this software (version 11), we have check the results provided by Tanagra on various datasets. We show here the results of the study on the CARS dataset. We extend the comparison to other data mining tools.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; pls regression, software comparison, simca-p, spad, sas, r software, pls package&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: PLSR, VIEW DATASET, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_PLSR_Software_Comparison.pdf" target="_blank"&gt;en_Tanagra_PLSR_Software_Comparison.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/cars_pls_regression.xls" target="_blank"&gt;cars_pls_regression.xls &lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt; :&lt;br /&gt;M. Tenenhaus, « La régression PLS – Théorie et pratique », Technip, 1998.&lt;br /&gt;D. Garson, « &lt;a href="http://www2.chass.ncsu.edu/garson/pa765/statnote.htm" target="_blank"&gt;Partial Least Squares Regression &lt;/a&gt;», from Statnotes: Topics in Multivariate Analysis.&lt;br /&gt;UMETRICS. &lt;a href="http://www.umetrics.com/default.asp/pagename/software_simcap/c/3" target="_blank"&gt;SIMCA-P&lt;/a&gt; for Multivariate Data Analysis.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5538024845921083630?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5538024845921083630'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5538024845921083630'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/08/pls-regression-software-comparison.html' title='PLS Regression - Software comparison'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2318214853926374974</id><published>2011-08-05T23:04:00.000-07:00</published><updated>2011-08-05T23:12:34.643-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>The CART method under Tanagra and R (rpart)</title><content type='html'>CART (Breiman and al., 1984) is a very popular classification tree (says also decision tree) learning algorithm. Rightly. CART incorporates all the ingredients of a good learning control: the post-pruning process enables to make the trade-off between the bias and the variance; the cost complexity mechanism enables to "smooth" the exploration of the space of solutions; we can control the preference for simplicity with the standard error rule (SE-rule); etc. Thus, the data miner can adjust the settings according to the goal of the study and the data characteristics.&lt;br /&gt;&lt;br /&gt;The Breiman's algorithm is provided under different designations in the free data mining tools. Tanagra uses the "C-RT" name. R, through a specific package , provides the "rpart" function.&lt;br /&gt;&lt;br /&gt;In this tutorial, we describe these implementations of the CART approach according to the original book (Breiman and al., 1984; chapters 3, 10 and 11). The main difference between them is the implementation of the post-pruning process. Tanagra uses a specific sample says "pruning set" (section 11.4); when rpart is based on the cross-validation principle (section 11.5) .&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; decision tree, classification tree, recursive partitioning, cart, R software, rpart package&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: DISCRETE SELECT EXAMPLES, C-RT, SUPERVISED LEARNING, TEST&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_R_CART_algorithm.pdf" target="_blank"&gt;en_Tanagra_R_CART_algorithm.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/wave5300.xls" target="_blank"&gt;wave5300.xls&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Breiman, J. Friedman, R. Olsen, C. Stone, Classification and Regression Trees, Chapman &amp;amp; Hall, 1984.&lt;br /&gt;"The R project for Statistical Computing" - &lt;a href="http://www.r-project.org/" target="_blank"&gt;http://www.r-project.org/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2318214853926374974?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2318214853926374974'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2318214853926374974'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/08/cart-method-under-tanagra-and-r-rpart.html' title='The CART method under Tanagra and R (rpart)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1853924429082354461</id><published>2011-07-23T23:10:00.000-07:00</published><updated>2011-07-23T23:20:09.465-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='PLS Regression'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>PLS Discriminant Analysis - A comparative study</title><content type='html'>PLS regression is a regression technique usually designed to predict the values taken by a group of Y variables (target variables, dependent variables) from a set of variables X (descriptors, independent variables). Initially defined for the prediction of continuous target variable, the PLS regression can be adapted to the prediction of one discrete variable - i.e. adapted to the supervised learning framework - in different ways . The approah is called "PLS Discriminant Analysis" in this context. It incorporates the valuable qualities that we know usually into this new framework: the ability to process a representation space with very high dimensionality, a large number of noisy and / or redundant descriptors.&lt;br /&gt;&lt;br /&gt;This tutorial is the continuation of a precedent paper dedicated to &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/pls-regression-for-classification-task.html"&gt;the presentation of some variants of the PLS-DA&lt;/a&gt;. We describe the behavior of one of them (PLS-LDA - PLS Linear Discriminant Analysis) on a learning set where the number of descriptors is moderately high (278 descriptors) in relation to the number of instances (232 instances). Even if the number of descriptors is not really very high, we note in our experiment a valuable characteristic of the PLS approach: we can control the variance of the classifier by adjusting the number of latent variables.&lt;br /&gt;&lt;br /&gt;To assess this idea, we compare the behavior of the PLS-LDA with state-of-the-art supervised learning methods such as K-nearest neighbors , SVM (Support Vector Machine from the LIBSVM library ), the Breiman's Random Forest approach , or the Fisher's Linear Discriminant Analysis .&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; pls regression, linear discriminant analysis, supervised learning, support vector machine, SVM, random  forest, nearest  neighbor&lt;br /&gt;&lt;strong&gt;Components:&lt;/strong&gt; K-NN, PLS-LDA, BAGGING, RND TREE, C-SVC, TEST, DISCRETE SELECT EXAMPLES, REMOVE CONSTANT&lt;br /&gt;&lt;strong&gt;Tutorial:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_PLS_DA_Comparaison.pdf" target="_blank"&gt;en_Tanagra_PLS_DA_Comparaison.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/arrhytmia.bdm" target="_blank"&gt;arrhytmia.bdm&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References :&lt;/strong&gt;&lt;br /&gt;S.  Chevallier, D. Bertrand, A. Kohler, P. Courcoux, « Application of  PLS-DA in multivariate image analysis », in J. Chemometrics, 20 :  221-229, 2006.&lt;br /&gt;Garson, « Partial Least Squares Regression (PLS) », &lt;a href="http://www2.chass.ncsu.edu/garson/PA765/pls.htm" target="_blank"&gt;http://www2.chass.ncsu.edu/garson/PA765/pls.htm&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1853924429082354461?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1853924429082354461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1853924429082354461'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/07/pls-discriminant-analysis-comparative.html' title='PLS Discriminant Analysis - A comparative study'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3230155024695403642</id><published>2011-07-17T10:53:00.001-07:00</published><updated>2011-07-19T06:34:02.404-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Tanagra add-on for OpenOffice Calc 3.3</title><content type='html'>Tanagra add-on for &lt;span style="color: rgb(51, 204, 0); font-weight: bold;"&gt;OpenOffice 3.3&lt;/span&gt; and &lt;span style="color: rgb(51, 204, 0); font-weight: bold;"&gt;LibreOffice 3.4&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The connection with spreadsheet applications is certainly a factor of success for Tanagra. It is easy to manipulate a dataset into OpenOffice Calc (up to version 3.2) and send it to Tanagra using the TanagraLibrary.zip extension for further analysis .&lt;br /&gt;&lt;br /&gt;Recently, users have reported to me that the mechanism did not work with recent versions of OpenOffice  (version 3.3) and LibreOffice  (version 3.4). I realized that, rather than a correction, it was more appropriate to elaborate a new module which meets the standard for managing extensions of these tools. The new library "&lt;span style="color: rgb(51, 204, 0); font-weight: bold;"&gt;TanagraModule.oxt&lt;/span&gt;" is now incorporated into the distribution.&lt;br /&gt;&lt;br /&gt;This tutorial describes how to install and to use this add-on under OpenOffice Calc 3.0. The adaptation to LibreOffice 3.4 is very easy.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt; : data importation, spreadsheet application, openoffice, libreoffice, add-in, add-on, excel&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Component &lt;/span&gt;: View Dataset&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Addon_OpenOffice_LibreOffice.pdf" target="_blank"&gt;en_Tanagra_Addon_OpenOffice_LibreOffice.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/breast.ods" target="_blank"&gt;breast.ods&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Références&lt;/span&gt; :&lt;br /&gt;Tutoriel Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/ooocalc-file-handling-using-add-in.html"&gt;OOo Calc file handling using an add-in&lt;/a&gt;"&lt;br /&gt;Tutoriel Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2009/04/launching-tanagra-from-oocalc-under.html"&gt;Launching Tanagra from OOo Calc under Linux&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3230155024695403642?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3230155024695403642'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3230155024695403642'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/07/tanagra-add-on-for-openoffice-calc-33.html' title='Tanagra add-on for OpenOffice Calc 3.3'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3324900599297262143</id><published>2011-07-05T06:04:00.000-07:00</published><updated>2011-07-05T07:09:16.432-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Tanagra - Version 1.4.40</title><content type='html'>Few improvements for this new version.&lt;br /&gt;&lt;br /&gt;A new addon for the connection between Tanagra and the recent version of OpenOffice Calc spreadsheet has been created. The old one did not work for recent versions - &lt;span style="color: rgb(51, 204, 0);"&gt;OpenOffice 3.3&lt;/span&gt; and &lt;span style="color: rgb(51, 204, 0);"&gt;LibreOffice 3.4&lt;/span&gt;. During the installation process, another library was added ("&lt;span style="color: rgb(51, 204, 0);"&gt;TanagraModule.oxt&lt;/span&gt;") to not interfere with the old, still functional for &lt;a href="http://data-mining-tutorials.blogspot.com/2009/04/launching-tanagra-from-oocalc-under.html"&gt;previous versions of Open Office&lt;/a&gt; (3.2 and earlier). A tutorial describing its installation and its utilization will be put online soon. I take this opportunity to highlight again how a privileged connection between a spreadsheet and a specialized tool for Data Mining is convenient. The annual poll organized by the &lt;a href="http://www.kdnuggets.com/" target="_blank"&gt;kdnuggets.com&lt;/a&gt; website shows the interest of this connection (&lt;a href="http://www.kdnuggets.com/polls/2011/tools-analytics-data-mining.html" target="_blank"&gt;2011&lt;/a&gt;, &lt;a href="http://www.kdnuggets.com/polls/2010/data-mining-analytics-tools.html" target="_blank"&gt;2010&lt;/a&gt;, &lt;a href="http://www.kdnuggets.com/polls/2009/data-mining-tools-used.htm" target="_blank"&gt;2009&lt;/a&gt;,...). We note that there is a similar addon for the R software (&lt;a href="http://wiki.services.openoffice.org/wiki/R_and_Calc" target="_blank"&gt;R4Calc&lt;/a&gt;). This change was suggested by Jérémy Roos (&lt;a href="http://www.openoffice.org/" target="_blank"&gt;OpenOffice&lt;/a&gt;) and Franck Thomas (&lt;a href="http://www.libreoffice.org/" target="_blank"&gt;LibreOffice&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;The non-standardized ACP is now available. It is possible to implement unchecking the option of standardization of the data in the Principal Component Analysis component. Change suggested by Elvire Antanjan.&lt;br /&gt;&lt;br /&gt;Simultaneous regression was introduced. It is very similar to the method programmed into LazStats, which is unfortunately more accessible freely now. The approach is described in a free booklet online "&lt;a href="http://eric.univ-lyon2.fr/%7Ericco/cours/cours/La_regression_dans_la_pratique.pdf" target="_blank"&gt;Practice of linear regression analysis&lt;/a&gt;" (in French) (section 3.6).&lt;br /&gt;&lt;br /&gt;The color codes according to the p-value have been introduced for the Linear Correlation component. Change suggested by Samuel KL.&lt;br /&gt;&lt;br /&gt;Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Donwload page&lt;/span&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/en/contenu_telechargement_logiciel_tanagra.html" target="_blank"&gt;setup&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3324900599297262143?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3324900599297262143'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3324900599297262143'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/07/tanagra-version-1440.html' title='Tanagra - Version 1.4.40'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6899548697628617002</id><published>2011-05-26T05:57:00.000-07:00</published><updated>2011-07-05T06:04:07.650-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.39</title><content type='html'>Some minor corrections for the Tanagra 1.4.39 version.&lt;br /&gt;&lt;br /&gt;For the &lt;span style="color: rgb(51, 204, 0);"&gt;PCA&lt;/span&gt; (principal component analysis) component, when we ask all the factors, none are generated. Reported by Jérémy Roos.&lt;br /&gt;&lt;br /&gt;In the previous 1.4.38 version, the results of &lt;span style="color: rgb(51, 204, 0);"&gt;Multinomial Logistic Regression&lt;/span&gt; are not consistent with the tutorial on the website. The calculations are wrong. Reported by Nicole Jurado.&lt;br /&gt;&lt;br /&gt;It is now possible to obtain the scores from the &lt;span style="color: rgb(51, 204, 0);"&gt;PLS-DA&lt;/span&gt; component (Partial Least Squares Regression - Discriminant Analysis). Reported by Carlos Serrano.&lt;br /&gt;&lt;br /&gt;All these bugs are corrected in the 1.4.39 version. Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.&lt;br /&gt;&lt;br /&gt;Donwload page : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/en/contenu_telechargement_logiciel_tanagra.html" target="_blank"&gt;setup&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6899548697628617002?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6899548697628617002'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6899548697628617002'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/05/tanagra-version-1439.html' title='Tanagra - Version 1.4.39'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1419133677054019536</id><published>2011-04-05T00:45:00.000-07:00</published><updated>2011-04-05T00:53:38.047-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><title type='text'>Mining Association Rule from Transactions File</title><content type='html'>Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain. But in fact, it can be implemented in various areas where we want to discover the associations between variables. The association is described by a "IF THEN" rule. The IF part is called "antecedent" of the rule; the THEN part correspond to the "consequent" e.g. IF onions AND potatoes THAN burger (&lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning" target="_blank"&gt;http://en.wikipedia.org/wiki/Association_rule_learning&lt;/a&gt;) i.e. if a customer buys onions and potatoes then he buys also burger.&lt;br /&gt;&lt;br /&gt;It is possible to find co-occurrences in the standard attribute - value tables that are handled with the most of the data mining tools. In this context, the rows correspond to the baskets (transactions); the columns correspond to the list of all possible products (items); at the intersection of the row and the column, we have an indicator (true/false or 1/0) which indicates if the item belongs to the transaction. But this kind of representation is too naive. A few products are incorporated in each basket. Each row of the table contains a few 1 and many 0. The size of the data file is unnecessarily excessive. Therefore, another data representation, says "transactions file", is often used to minimize the data file size. In this tutorial, we treat a special case of the transactions file. The principle is based on the enumeration of the items included in each transaction. But in our case, we have only two values for each row of the data file: the transaction identifier, and the item identifier. Thus, each transaction can be listed on several rows of the data file.&lt;br /&gt;&lt;br /&gt;This data representation is quite natural considering the problem we want to treat. It also has the advantage of being more compact since only the items really present in each transaction are enumerated. However, it appears that many tools do not know manage directly this kind of data representation. We observe curiously a distinction between professional tools and the academic ones. The first ones can handle directly this kind of data file without special data preparation. This is the case of &lt;span style="font-weight: bold; color: rgb(51, 204, 255);"&gt;SPAD 7.3&lt;/span&gt; and &lt;span style="color: rgb(51, 204, 255); font-weight: bold;"&gt;SAS Enterprise Miner 4.3&lt;/span&gt; that we study in this tutorial. On the other hand, the academic tools need a data transformation, prior the importation of the dataset. We use a small program written in VBA (Visual Basic for Applications) under Excel to prepare the dataset. Thereafter, we perform the analysis with &lt;span style="color: rgb(51, 255, 51); font-weight: bold;"&gt;Tanagra 1.4.37&lt;/span&gt; and &lt;span style="color: rgb(51, 255, 51); font-weight: bold;"&gt;Knime 2.2.2 &lt;/span&gt;(&lt;span style="font-weight: bold; font-style: italic;"&gt;Note&lt;/span&gt;&lt;span style="font-style: italic;"&gt;: a reader told me that we can transform the dataset with Knime without the utilization of external program. This is true. I will describe this approach in a separate section at the end of this tutorial&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;Attention, we must respect the original specifications i.e. focus only on rules indicating the simultaneous presence of items in transactions.  We must not, consecutively to a bad "presence - absence" coding scheme, to generate rules outlining the simultaneous absence of some items. This may be interesting in some cases may be, but this is not the purpose of our analysis.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: association rules, a priori algorithm&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: A priori&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Assoc_Rule_Transactions.pdf" target="_blank"&gt;en_Tanagra_Assoc_Rule_Transactions.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/assoc_rule_transactions.zip" target="_blank"&gt;assoc_rule_transactions.zip&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;Tanagra Tutorials, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/association-rule-learning-from.html"&gt;Association rule learning from transaction dataset&lt;/a&gt;"&lt;br /&gt;P.N. Tan, M. Steinbach, V. Kumar, « &lt;a href="http://www-users.cs.umn.edu/%7Ekumar/dmbook/index.php" target="_blank"&gt;Introduction to Data Mining&lt;/a&gt; », Addison Wesley, 2006 ; chapitre 6, « &lt;a href="http://www-users.cs.umn.edu/%7Ekumar/dmbook/ch6.pdf" target="_blank"&gt;Association analysis : Basic Concepts and Algorithms&lt;/a&gt; ».&lt;br /&gt;Wikipedia - "&lt;a href="http://en.wikipedia.org/wiki/Association_rule"&gt;Association rule learning&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1419133677054019536?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1419133677054019536'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1419133677054019536'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/04/mining-association-rule-from.html' title='Mining Association Rule from Transactions File'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6826794804495294855</id><published>2011-02-20T00:04:00.000-08:00</published><updated>2011-02-20T00:08:22.501-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Regression analysis'/><title type='text'>Multiple Regression - Reading the results</title><content type='html'>The aim of the multiple regression is to predict the values of a continuous dependent variable Y from a set of continuous or binary independent variables (X1,..., Xp).&lt;br /&gt;&lt;br /&gt;In this tutorial, we want to model the relationship between the cars consumption and their weight, engine-size and horsepower. We describe the outputs of Tanagra by associating them with the used formulas. We highlight the importance of the unscaled covariance matrix of the estimated coefficients [(X'X)-1] (&lt;span style="color: rgb(0, 153, 0);"&gt;Tanagra 1.4.38 and later&lt;/span&gt;). It is used for the subsequent analysis: individual significance of coefficients, simultaneous significance of several coefficients, testing linear combinations of coefficients, computation of the standard error for the prediction interval. These analyses are performed into the Excel spreadsheet.&lt;br /&gt;&lt;br /&gt;Thereafter, we perform the same analyses with the &lt;span style="color: rgb(0, 153, 0);"&gt;R software&lt;/span&gt;. We identify the objects provided by the lm(.) procedure that we can use in the same context.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: linear regression, multiple regression, R software, lm, summary.lm, testing significance, prediction interval&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: MULTIPLE LINEAR REGRESSION&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Multiple_Regression_Results.pdf" target="_blank"&gt;en_Tanagra_Multiple_Regression_Results.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/cars_consumption.zip" target="_blank"&gt;cars_consumption.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt; :&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/regress.htm" target="_blank"&gt;Multiple Regression&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6826794804495294855?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6826794804495294855'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6826794804495294855'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/02/multiple-regression-reading-results.html' title='Multiple Regression - Reading the results'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1474569535027730219</id><published>2011-02-04T06:49:00.001-08:00</published><updated>2011-02-04T06:51:37.528-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.38</title><content type='html'>Some minor corrections for the Tanagra 1.4.38 version.&lt;br /&gt;&lt;br /&gt;The color codes for the normality tests have been harmonized (&lt;span style="color: rgb(0, 153, 0);"&gt;Normality Test&lt;/span&gt;). In some configurations, the colors associated with p-values were not consistent, it could misleading the users. This problem has been reported by Lawrence M. Garmendia.&lt;br /&gt;&lt;br /&gt;Following indications from Mr. Oanh Chau, I realized that the standardization of variables to the &lt;span style="color: rgb(0, 153, 0);"&gt;HAC&lt;/span&gt; (hierarchical agglomerative clustering) was based on the sample standard deviation. This is not an error in itself. But the sum of index of level into the dendrogram does not consistent with the TSS (total sum of squares). This is unwelcome. The difference is especially noticeable on small dataset, it disappears when the dataset size increases. The correction has been introduced. Now the BSS ratio is equal to 1 when we have the trivial partition i.e. one individual per group.&lt;br /&gt;&lt;br /&gt;Multiple linear regression (&lt;span style="color: rgb(0, 153, 0);"&gt;MULTIPLE LINEAR REGRESSION&lt;/span&gt;) displays the matrix (X'X) ^ (-1). It allows to deduce the variance covariance matrix of coefficients (by multiplying the matrix by the estimated variance of the error). It can be also used in the generalized tests for the model coefficients.&lt;br /&gt;&lt;br /&gt;Last, the outputs of the descriptive discriminant analysis (&lt;span style="color: rgb(0, 153, 0);"&gt;CANONICAL DISCRIMINANT ANALYSIS&lt;/span&gt;) were improved. The group centroids (Group centroids) on the factorial axes are directly provided.&lt;br /&gt;&lt;br /&gt;Thank you very much to all those who help me to improve this work by their comments or suggestions.&lt;br /&gt;&lt;br /&gt;Download page: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/en/contenu_telechargement_logiciel_tanagra.html" target="_blank"&gt;setup&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1474569535027730219?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1474569535027730219'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1474569535027730219'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/02/tanagra-version-1438.html' title='Tanagra - Version 1.4.38'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3749773494061368837</id><published>2011-01-04T07:25:00.000-08:00</published><updated>2011-01-04T07:27:35.643-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra website statistics for 2010</title><content type='html'>The year 2010 ends, 2011 begins. I wish you all a very happy year 2011.&lt;br /&gt;&lt;br /&gt;A small statistical report on the website statistics for the past year. All sites (Tanagra, course materials, e-books, tutorials) has been visited 241,765 times this year, 662 visits per day. For comparison, we had  520 daily visits in 2009 and 349 in 2008.&lt;br /&gt;&lt;br /&gt;Who are you? The majority of visits come from France and Maghreb (62%). Then there are a large part of French speaking countries. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, Brazil,...&lt;br /&gt;&lt;br /&gt;Which pages are visited? The pages that are most successful are those that relate to documentation about the Data Mining: course materials, tutorials, links to other documents available on line, etc.. This is hardly surprising. I take more time myself to write booklets and tutorials, to study the behavior of different software, of which Tanagra.&lt;br /&gt;&lt;br /&gt;Happy New Year 2011 to all.&lt;br /&gt;&lt;br /&gt;Ricco.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Slideshow&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Frequentation_2010.pdf" target="_blank"&gt;Website statistics for 2010&lt;br /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3749773494061368837?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3749773494061368837'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3749773494061368837'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2011/01/tanagra-website-statistics-for-2010.html' title='Tanagra website statistics for 2010'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4866690558847292074</id><published>2010-12-09T02:24:00.000-08:00</published><updated>2010-12-09T02:28:45.290-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Creating reports with Tanagra</title><content type='html'>The ability to create automatically reports from the results of an analysis is a valuable functionality for Data Mining. But this is rather an asset to the professional tools. The programming of this kind of functionality is not really promoted in the academic domain. I do not think that I can publish a paper in a journal where I describe the ability of Tanagra to create attractive reports. This is the reason for which the output of the academic tools, such as R or Weka, is mainly in a formatted text shape.&lt;br /&gt;&lt;br /&gt;Tanagra, which is an academic tool, provides also text outputs. The programming remains simple if we see at a glance the source code. But, in order to make the presentation more attractive, it uses the HTML to format the results. I take advantage of this special feature to generate reports without making a particular programming effort. Tanagra is one of the few academic tools to be able to produce reports that can easily be displayed in office automation software. For instances, the tables can be copied into Excel spreadsheets for further calculations. More generally, the results can be viewed in a browser, regardless of data mining software.&lt;br /&gt;&lt;br /&gt;These are the reporting features of Tanagra that we present in this tutorial.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: reporting, decision tree, c4.5, logistic regression, binary coding, roc curve, learning sample, test sample, forward, feature selection&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt;: GROUP CHARACTERIZATION, SAMPLING, C4.5, TEST, O_1_BINARIZE, FORWARD-LOGIT, BINARY LOGISTIC REGRESSION, SCORING, ROC CURVE&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Reporting.pdf" target="_blank"&gt;en_Tanagra_Reporting.pdf&lt;br /&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/heart_disease_male_for_reporting.xls" target="_blank"&gt;heart disease&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4866690558847292074?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4866690558847292074'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4866690558847292074'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/12/creating-reports-with-tanagra.html' title='Creating reports with Tanagra'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3378022511446387135</id><published>2010-11-24T07:58:00.000-08:00</published><updated>2010-11-24T08:03:51.912-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Multithreading for decision tree induction</title><content type='html'>Nowadays, much of modern personal computers (PC) have multicore processors. The computer operates as if it had multiple processors. Software and data mining algorithms must be modified in order to benefit of this new feature.&lt;br /&gt;&lt;br /&gt;Currently, few free tools exploit this opportunity because it is impossible to define a generic approach that would be valid regardless of the learning method used. We must modify each existing learning algorithm. For a given technique, decomposing an algorithm into elementary tasks that can execute in parallel is a research field in itself. In a second step, we must adopt a programming technology which is easy to implement.&lt;br /&gt;&lt;br /&gt;In this tutorial, I propose a technology based on threads for the induction of decision trees. It is well suited in our context for various reasons. (1) It is easy to program with the modern programming languages. (2) Threads can share information; they can also modify common objects. Efficient synchronization tools enable to avoid data corruption. (3) We can launch multiple threads on a mono-core and mono-processor system. It is not really advantageous, but at least the system does not crash. (4) On a multiprocessor or multi-core system, the threads will actually run at the same time, with each processor or core running a particular thread. But, because of the necessity of synchronization between threads, the computation time is not divided by the number of cores in this case.&lt;br /&gt;&lt;br /&gt;First, we briefly present the modification of the decision tree learning algorithm in order to benefit of the multithreading technology. Then, we show how to implement the approach with &lt;span style="color: rgb(153, 0, 0);"&gt;SIPINA&lt;/span&gt; (version &lt;span style="color: rgb(153, 0, 0);"&gt;3.5&lt;/span&gt; and later). We show also that the multithreaded decision tree learners are available in various tools such as &lt;span style="color: rgb(0, 153, 0);"&gt;Knime 2.2.2&lt;/span&gt; or &lt;span style="color: rgb(0, 153, 0);"&gt;RapidMiner 5.0.011&lt;/span&gt;. Last, we study the behavior of the multithreaded algorithms according to the dataset characteristics.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: multithreading, thread, threads, decision tree, chaid, sipina 3.5, knime 2.2.2, rapidminer 5.0.011&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_sipina_multithreading.pdf" target="_blank"&gt;en_sipina_multithreading.pdf&lt;/a&gt;&lt;span style="text-decoration: underline;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/covtype.arff.zip" target="_blank"&gt;covtype.arff.zip&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt; :&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Decision_tree_learning"&gt;Decision tree learning&lt;/a&gt;"&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Thread_%28computer_science%29"&gt;Thread (Computer science)&lt;/a&gt;"&lt;br /&gt;Aldinucci, Ruggieri, Torquati, " &lt;a href="http://www.di.unipi.it/%7Eruggieri/Papers/pkdd2010.pdf" target="_blank"&gt;Porting Decision Tree Algorithms to Multicore using FastFlow&lt;/a&gt; ", Pkdd-2010.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3378022511446387135?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3378022511446387135'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3378022511446387135'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/11/multithreading-for-decision-tree.html' title='Multithreading for decision tree induction'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3893972798363934517</id><published>2010-11-10T21:25:00.000-08:00</published><updated>2010-11-10T21:33:39.764-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Naive bayes classifier for continuous predictors</title><content type='html'>The naive bayes classifier is a very popular approach even if it is (apparently) based on an unrealistic assumption: the distributions of the predictors are mutually independent conditionally to the values of the target attribute. The main reason of this popularity is that the method proved to be as accurate as the other well-known approaches such as linear discriminant analysis or logistic regression on the majority of the real dataset.&lt;br /&gt;&lt;br /&gt;But an obstacle to the utilization of the naive bayes classifier remains when we deal with a real problem. It seems that we cannot provide an explicit model for its deployment. The proposed representation by the &lt;a href="http://www.dmg.org/v4-0-1/NaiveBayes.html" target="_blank"&gt;PMML standard&lt;/a&gt; for instance is particularly unattractive. The interpretation of the model, especially the detection of the influence of each descriptor on the prediction of the classes is impossible.&lt;br /&gt;&lt;br /&gt;This assertion is not entirely true. We have showed in a previous tutorial that we can extract an explicit model from the naive bayes classifier in the case of discrete predictors (see references). We obtain a linear combination of the binarized predictors. In this document, we show that the same mechanism can be implemented for the continuous descriptors. We use the standard Gaussian assumption for the conditional distribution of the descriptors. According to the heteroscedastic assumption or the homoscedastic assumption, we can provide a quadratic model or a linear model. This last one is especially interesting because we obtain a model that we can directly compare to the other linear classifiers (the sign and the values of the coefficients of the linear combination).&lt;br /&gt;&lt;br /&gt;This tutorial is organized as follows. In the next section, we describe the approach. In the section 3, we show how to implement the method with &lt;span style="color: rgb(153, 0, 0);"&gt;Tanagra 1.4.37&lt;/span&gt; (and later). We compare the results to those of the other linear methods. In the section 4, we compare the results provided by various data mining tools. We note that none of them proposes an explicit model that could be easy to deploy. They give only the estimated parameters of the conditional Gaussian distribution (mean and standard deviation). Last, in the section 5, we show the interest of the naive bayes classifier over the other linear methods when we handle a large dataset (the "mutant" dataset - 16,592 instances and 5,408 predictors). The computation time and the memory occupancy are clearly advantageous.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: naive bayes classifier, &lt;span style="color: rgb(0, 153, 0);"&gt;rapidminer 5.0.10&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;weka  3.7.2&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;knime 2.2.2&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;R&lt;/span&gt; software, &lt;span style="color: rgb(0, 153, 0);"&gt;package e1071&lt;/span&gt;, linear discriminant analysis, pls discriminant analysis, linear svm, logistic regression&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt; : NAIVE BAYES CONTINUOUS, BINARY LOGISTIC REGRESSION, SVM, C-PLS, LINEAR DISCRIMINANT ANALYSIS&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial:&lt;/span&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Naive_Bayes_Continuous_Predictors.pdf" target="_blank"&gt;en_Tanagra_Naive_Bayes_Continuous_Predictors.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/breast.txt" target="_blank"&gt;breast&lt;/a&gt; ; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/low_birth_weight_nbc.arff" target="_blank"&gt;low birth weight&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt; :&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier" target="_blank"&gt;Naive bayes classifier&lt;/a&gt;"&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2010/07/naive-bayes-classifier-for-discrete.html"&gt;Naive bayes classifier for discrete predictors&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3893972798363934517?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3893972798363934517'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3893972798363934517'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/11/naive-bayes-classifier-for-continuous.html' title='Naive bayes classifier for continuous predictors'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5157036174110104915</id><published>2010-10-19T09:54:00.000-07:00</published><updated>2010-10-19T09:55:48.911-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.37</title><content type='html'>&lt;span style="font-weight: bold; color: rgb(0, 153, 0);"&gt;Naive Bayes Continuous&lt;/span&gt; is a supervised learning component. It implements the naive bayes principle for continuous predictors (gaussian assumption, heteroscedasticity or homoscedasticity). The main originality is that it provides an explicit model corresponding to a linear combination of predictors and, eventually, their square.&lt;br /&gt;&lt;br /&gt;Enhancement of the &lt;span style="font-weight: bold; color: rgb(0, 153, 0);"&gt;reporting &lt;/span&gt;module.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5157036174110104915?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5157036174110104915'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5157036174110104915'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/10/tanagra-version-1437.html' title='Tanagra - Version 1.4.37'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3774951973658351277</id><published>2010-10-14T01:06:00.000-07:00</published><updated>2010-10-14T01:12:21.471-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Feature Selection'/><title type='text'>Filter methods for feature selection</title><content type='html'>The nature of the predictors' selection process has changed considerably. Previously, works in machine learning concentrated on the research of the best subset of features for a learning classifier, in the context where the number of candidate features was rather reduced and the computing time was not a major constraint. Today, it is common to deal with datasets comprising thousands of descriptors. Consequently, the problem of feature selection always consists in finding the most relevant subset of predictors but by introducing a new strong constraint: the computing time must remain reasonable.&lt;br /&gt;&lt;br /&gt;In this tutorial, we are interested in correlation based filter approaches for &lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;discrete predictors&lt;/span&gt;. The goal is to highlight the most relevant subset of predictors which are highly correlated with the target attribute and, in the same time, which are weakly correlated between them i.e. which are not redundant. To evaluate the behavior of the various methods, we use an artificial dataset where we add irrelevant and redundant candidate variables. Then, we perform a feature selection based on the approaches analyzed. We compare the generalization error rate of the naive bayes classifier learned from the various subsets of selected variables. We lead the experimentation with Tanagra in a first time. Then, in a second time, we show how to perform the same analysis with other tools (&lt;span style="color: rgb(51, 102, 255);"&gt;Weka 3.6.0&lt;/span&gt;, &lt;span style="color: rgb(51, 51, 255);"&gt;Orange 2.0b&lt;/span&gt;, &lt;span style="color: rgb(51, 51, 255);"&gt;RapidMiner 4.6.0&lt;/span&gt;, &lt;span style="color: rgb(51, 51, 255);"&gt;R 2.9.2&lt;/span&gt; - package &lt;span style="color: rgb(51, 51, 255);"&gt;FSelector&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: filter, feature selection, correlation based measure, discrete predictors, naive bayes classifier, bootstrap&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt;: FEATURE RANKING, CFS FILTERING, MIFS FILTERING, FCBF FILTERING, MODTREE FILTERING, NAIVE BAYES, BOOTSTRAP&lt;br /&gt;&lt;b&gt;Tutorial:&lt;/b&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Filter_Method_Discrete_Predictors.pdf" target="_blank"&gt;en_Tanagra_Filter_Method_Discrete_Predictors.pdf&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Dataset:&lt;/b&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/vote_filter_approach.zip" target="_blank"&gt;vote_filter_approach.zip&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References: &lt;/b&gt;&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/search/label/Feature%20Selection"&gt;Feature Selection&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3774951973658351277?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3774951973658351277'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3774951973658351277'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/10/filter-methods-for-feature-selection.html' title='Filter methods for feature selection'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7903721472103544510</id><published>2010-08-30T06:11:00.000-07:00</published><updated>2010-08-30T06:17:40.960-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Connecting Sipina and Excel using OLE</title><content type='html'>The connection between a data mining tool and Excel (and more generally spreadsheet) is a very important issue. We had addressed many times this topic in our tutorials. With hindsight, I think the solution based on add-ins for Excel is the best one, both for &lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/sipina-add-in-for-excel.html"&gt;SIPINA&lt;/a&gt; and for &lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/tanagra-add-in-for-office-2007-and.html"&gt;TANAGRA&lt;/a&gt;. It is simple, reliable and highly efficient. It does not require developing specific versions. The connection with Excel is a simple additional functionality of the standard distribution.&lt;br /&gt;&lt;br /&gt;Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina using OLE is prohibitive when the database size grows.&lt;br /&gt;&lt;br /&gt;Thus, XL-SIPINA is essentially an attempt short-lived. There is always a  bit of nostalgia when I am back on solutions I have explored, and I  have finally abandoned. Can be also I have not completely explored this  solution.&lt;br /&gt;&lt;br /&gt;Last, the application was initially developed for  Office 97. I note that it still up to date today, it works fine with  Office 2010.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: excel, tableur, sipina, xls, xlsx, xl-sipina, decision tree induction&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Download XL-SIPINA&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/softs/setup_xl_sipina.exe" target="_blank"&gt;XL-SIPINA&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/softs/en_xls_sipina.pdf" target="_blank"&gt;en_xls_sipina.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/auto_for_decision_tree_analysis.xls" target="_blank"&gt;autos&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7903721472103544510?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7903721472103544510'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7903721472103544510'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/08/connecting-sipina-and-excel-using-ole.html' title='Connecting Sipina and Excel using OLE'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1588652364488914774</id><published>2010-08-27T02:38:00.000-07:00</published><updated>2010-08-27T03:00:51.748-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Sipina add-in for Excel</title><content type='html'>The data importation is a bottleneck for Data Mining Tools. The majority of users are working with a spreadsheet tool such as Excel, mainly in the coupling with specialized software for data mining (see &lt;a href="http://www.kdnuggets.com/polls/2010/data-mining-analytics-tools.html" target="_blank"&gt;KDnuggets polls&lt;/a&gt;). Therefore, a recurring issue for users is "how to send my data from Excel to SIPINA?"&lt;br /&gt;&lt;br /&gt;It is possible to import different types of formats into SIPINA. About Excel workbooks, one particular device has been implemented.&lt;br /&gt;&lt;br /&gt;An add-in is automatically copied to the computer during the installation process. It must be integrated into Excel. The add-in incorporates a new menu into Excel. After selecting the data range, the user only has to activate it, this leads to the following: (1) SIPINA starts automatically, (2) the data are transferred via the clipboard and (3) SIPINA considers the first row of the range of cells corresponds to the names of variables, (4) columns with numerical values of the variables are quantitative (5) columns with alphanumeric values are categorical variables.&lt;br /&gt;&lt;br /&gt;Unlike the other tutorials, the sequence of manipulations is described in a video. The description is right only for the versions up to Excel 2003. Another tutorial about &lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/tanagra-add-in-for-office-2007-and.html"&gt;the using of the add-in under Office 2007 and Office 2010&lt;/a&gt; is described below.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: excel file format, add-in, decision tree&lt;br /&gt;&lt;strong&gt;Installing the add-in :&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/sipina_xla_installation.htm" target="_blank"&gt;sipina_xla_installation.htm&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Using the add-in:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/sipina_xla_processing.htm" target="_blank"&gt;sipina_xla_processing.htm&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1588652364488914774?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1588652364488914774'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1588652364488914774'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/08/sipina-add-in-for-excel.html' title='Sipina add-in for Excel'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2343715851998873339</id><published>2010-08-27T01:36:00.000-07:00</published><updated>2010-08-27T02:49:54.006-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Tanagra add-in for Office 2007 and Office 2010</title><content type='html'>The "tanagra.xla" add-in for Excel contributes to the wide diffusion of Tanagra. The principle is simple. It is to embed a Tanagra menu in Excel. Thus the user can run statistical calculations without having to leave the spreadsheet. It seems simplistic. But this feature facilitates immensely the work of data miner. Indeed, the spreadsheet is one of the most used tools for preparing dataset (see KDNuggets Polls: &lt;a href="http://www.kdnuggets.com/polls/2008/tools-languages-used-data-cleaning.htm" target="_blank"&gt;Tools / Languages for Data Cleaning&lt;/a&gt; - 2008). By embedding the data mining tool in the spreadsheet environment, it avoids to the practitioner the tedious and repetitive manipulations: importing the dataset, exporting the dataset, checking the compatibilities between data file formats, etc.&lt;br /&gt;&lt;br /&gt;The installation and the use of the "tanagra.xla" add-in under the previous versions of Office are described &lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/excel-file-handling-using-add-in.html"&gt;elsewhere&lt;/a&gt; (Office 1997 to Office 2003). This description is obsolete for the latest version of Office because the organization of the menus is modified for these versions i.e. Office 2007 and Office 2010. And yet, the add-in is still operational. In this tutorial, we show how to install and to use the Tanagra add-in under Office 2007 and 2010.&lt;br /&gt;&lt;br /&gt;This transition to recent versions of Excel is absolutely not without consequences. Indeed, compared to the previous Excel versions, Excel 2007 (and 2010) and can handle more important rows and columns. &lt;span style="color: rgb(153, 51, 0);"&gt;We can process a dataset up to 1,048,575 observations&lt;/span&gt; (the first line corresponds to the variable names) &lt;span style="color: rgb(153, 0, 0);"&gt;and 16,384 variables&lt;/span&gt;. In this tutorial, we will treat a database with 100,000 observations and 22 variables (wave100k.xlsx). This is a version of the famous &lt;a href="http://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator+%28Version+1%29" target="_blank"&gt;waveform&lt;/a&gt; database. Note that this file, because of the number of rows, cannot be manipulated by earlier versions of Excel.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;The process described in this document is also valid for the SIPINA add-in&lt;/span&gt; (&lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/sipina-add-in-for-excel.html"&gt;sipina.xla&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: data importation, excel, add-in&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt;: VIEW DATASET&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Add_In_Excel_2007_2010.pdf" target="_blank"&gt;en_Tanagra_Add_In_Excel_2007_2010.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/wave100k.xlsx" target="_blank"&gt;wave100k.xlsx&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/excel-file-handling-using-add-in.html"&gt;Excel file handling using an add-in&lt;/a&gt;".&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/ooocalc-file-handling-using-add-in.html"&gt;OOo Calc file handling using an add-in&lt;/a&gt;".&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2009/04/launching-tanagra-from-oocalc-under.html"&gt;Launching Tanagra from OOo Calc under Linux&lt;/a&gt;".&lt;br /&gt;Tanagra, "&lt;a href="http://data-mining-tutorials.blogspot.com/2010/08/sipina-add-in-for-excel.html"&gt;Sipina add-in for Excel&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2343715851998873339?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2343715851998873339'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2343715851998873339'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/08/tanagra-add-in-for-office-2007-and.html' title='Tanagra add-in for Office 2007 and Office 2010'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8671610544546668437</id><published>2010-07-24T06:42:00.001-07:00</published><updated>2010-07-24T07:41:22.788-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Naive bayes classifier for discrete predictors</title><content type='html'>The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Yet, despite this, it appears robust and efficient. Its performance is comparable to other supervised learning techniques.&lt;br /&gt;&lt;br /&gt;We introduce in &lt;span style="color: rgb(153, 0, 0);"&gt;Tanagra&lt;/span&gt; (&lt;span style="color: rgb(153, 0, 0);"&gt;version 1.4.36&lt;/span&gt; and later) a new presentation of the results of the learning process. The classifier is easier to understand, and its deployment is also made easier.&lt;br /&gt;&lt;br /&gt;In the first part of this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the model) to those obtained with other linear approaches such as the logistic regression, the linear discriminant analysis and the linear SVM. We note that the results are highly consistent. This largely explains the good performance of the method in comparison to others.&lt;br /&gt;&lt;br /&gt;In the second part, we use various tools on the same dataset (&lt;span style="color: rgb(0, 153, 0);"&gt;Weka 3.6.0&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;R 2.9.2&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;Knime 2.1.1&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;Orange 2.0b&lt;/span&gt; and &lt;span style="color: rgb(0, 153, 0);"&gt;RapidMiner 4.6.0&lt;/span&gt;). We try above all to understand the obtained results.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: naive bayes, linear classifier, linear discriminant analysis, logistic regression, linear support vector machine, svm&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt;: NAIVE BAYES, LINEAR DISCRIMINANT ANALYSIS, BINARY LOGISTIC REGRESSION, SVM, 0_1_BINARIZE&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Naive_Bayes_Classifier_Explained.pdf" target="_blank"&gt;en_Tanagra_Naive_Bayes_Classifier_Explained.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/heart_for_naive_bayes.zip" target="_blank"&gt;heart_for_naive_bayes.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;   :&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier" target="_blank"&gt;Naive bayes  classifier&lt;/a&gt;".&lt;br /&gt;T. Mitchell, "&lt;a href="http://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf" target="_blank"&gt;Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression&lt;/a&gt;", in Machine Learning, Chapter 1, 2005.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8671610544546668437?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8671610544546668437'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8671610544546668437'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/07/naive-bayes-classifier-for-discrete.html' title='Naive bayes classifier for discrete predictors'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5071610824231660345</id><published>2010-07-20T20:42:00.000-07:00</published><updated>2010-07-20T20:47:55.275-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Interactive decision tree learning with Spad</title><content type='html'>In this tutorial, we will be interested in SPAD. This is a French software specialized in exploratory data analysis which evolved much these last years. We would perform a sequence of analysis from a dataset collected into 3 worksheets of a Excel data file: (1) we create a classification tree from the learning sample into the first worksheet, we try to analyze deeply some nodes of the tree to highlight the characteristics of covered instances, we try also to modify interactively (manually) the properties of some splitting operation; (2) we apply the classifier on unseen cases of the second worksheet; (3) we compare the prediction of the model with the actual values of the target attribute contained into the third worksheet.&lt;br /&gt;&lt;br /&gt;Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial tools (e.g. SPSS Modeler, SAS Enterprise Miner, STATISTICA Data Miner…), we can very easily specify the whole sequence, even if we are not especially familiarized with data mining tools.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: decision tree, classification tree, interactive decision tree, spad, sipina, r software&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Arbres_IDT_Spad.pdf" target="_blank"&gt;en_Tanagra_Arbres_IDT_Spad.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset:&lt;/span&gt;&lt;span&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/pima-arbre-spad.zip" target="_blank"&gt;pima-arbre-spad.zip&lt;/a&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;References&lt;/span&gt; :&lt;br /&gt;SPAD, &lt;a href="http://www.spad.eu/" target="_blank"&gt;http://www.spad.eu/&lt;/a&gt;&lt;br /&gt;SIPINA, &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/sipina.html" target="_blank"&gt;http://eric.univ-lyon2.fr/~ricco/sipina.html&lt;/a&gt;&lt;br /&gt;R Project, &lt;a href="http://www.r-project.org/" target="_blank"&gt;http://www.r-project.org/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5071610824231660345?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5071610824231660345'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5071610824231660345'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/07/interactive-decision-tree-learning-with.html' title='Interactive decision tree learning with Spad'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7702541561563780532</id><published>2010-07-12T10:27:00.000-07:00</published><updated>2010-07-12T10:33:10.223-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Supervised learning from imbalanced dataset</title><content type='html'>In real problems, the classes are not equally represented in dataset. The instances corresponding to positive class, the one that we want to detect often, are few. For instance, in a fraud detection problem, there are a very few cases of fraud comparing to the large number of honest connections; in a medical problem, the ill persons are fortunately rare; etc. In these situations, using the standard learning process and assessing the classifier with the confusion matrix and the misclassification rate are not appropriate. We observe that the default classifier consisting to assign all the instances to the majority class is the one which minimizes the error rate.&lt;br /&gt;&lt;br /&gt;For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a good classifier, especially because it does not supply a degree of membership to the classes (Note: in fact, it assigns the same degree of membership to all the instances).&lt;br /&gt;&lt;br /&gt;A strategy enables to improve the behavior of the learning algorithms facing to the imbalance problem is to artificially balance the dataset. We can do this by eliminating some instances of the over-sized class (downsizing) or by duplicating some instances of the small class (over sampling). But few persons analyze the consequence of this solution on the performance of the classifier.&lt;br /&gt;&lt;br /&gt;In this tutorial, we highlight the consequences of the downsizing on the behavior of the &lt;span style="color: rgb(0, 153, 0);"&gt;logistic regression&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: imbalanced dataset, logistic regression, over sampling, under sampling&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt;: BINARY LOGISTIC REGRESSION, DISCRETE SELECT EXAMPLES, SCORING, RECOVER EXAMPLES, ROC CURVE, TEST&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Imbalanced_Dataset.pdf" target="_blank"&gt;en_Tanagra_Imbalanced_Dataset.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;   : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/imbalanced_dataset.xls" target="_blank"&gt;imbalanced_dataset.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;     :&lt;br /&gt;D. Hosmer, S. Lemeshow, «  Applied Logistic  Regression », John Wiley &amp;amp;Sons, Inc, Second  Edition, 2000.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7702541561563780532?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7702541561563780532'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7702541561563780532'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/07/supervised-learning-from-imbalanced.html' title='Supervised learning from imbalanced dataset'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4731981155829981389</id><published>2010-06-09T07:22:00.000-07:00</published><updated>2010-06-09T07:33:32.401-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Handling large dataset in R - The "filehash" package</title><content type='html'>The processing of very large datasets is a crucial problem in data mining. To handle them, we must avoid to load the whole dataset into memory. The idea is quite simple: (1) we write all or a part of the dataset on the disk in a binary file format to allow a direct access; (2) the machine learning algorithms must be modified to efficiently access the values stored on the disk. Thus, the characteristics of the computer are no longer a bottleneck for the handling of a large dataset.&lt;br /&gt;&lt;br /&gt;In this tutorial, we describe the great "filehash" package for R. It allows to copy (to dump) any kind of R objects into a file. We can handle these objects without loading them into main memory. This is especially useful for the data frame object. Indeed, we can perform a statistical analysis with the usual functions directly from a database on the disk. The processing capacities are vastly improved and, in the same time, we will note that the increase in computation time remains moderate.&lt;br /&gt;&lt;br /&gt;To evaluate the "filehash" solution, we analyze the memory occupation and the computation time, with and without utilization of the package, during the performing of decision tree learning with rpart (rpart package) and a linear discriminant analysis with lda (MASS package). We perform the same experiments using SIPINA. Indeed, it provides also a swapping system (the data is dumped from the main memory to temporary files) for the handling of very large dataset. We can then compare the performances of the various solutions.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: very large dataset, filehash, decision tree, linear discriminant analysis, sipina,  C4.5, rpart, lda&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf" target="_blank"&gt;en_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Données&lt;/strong&gt;   : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/wave2M.txt.zip" target="_blank"&gt;wave2M.txt.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;  :&lt;br /&gt;R  package, "&lt;a href="http://cran.r-project.org/web/packages/filehash/index.html" target="_blank"&gt;Filehash : Simple key-value database&lt;/a&gt;"&lt;br /&gt;Yu-Sung Su's Blog, "&lt;a href="http://yusung.blogspot.com/2007/09/dealing-with-large-data-set-in-r.html" target="_blank"&gt;Dealing with large dataset in R&lt;/a&gt;"&lt;br /&gt;Tanagra Tutorial, "&lt;a href="http://data-mining-tutorials.blogspot.com/2010/01/dealing-with-very-large-dataset-in.html"&gt;Dealing with very large dataset in Sipina&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4731981155829981389?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4731981155829981389'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4731981155829981389'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/06/handling-large-dataset-in-r.html' title='Handling large dataset in R - The &quot;filehash&quot; package'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4326915598201712548</id><published>2010-05-27T05:17:00.000-07:00</published><updated>2010-05-27T05:23:40.742-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Logistic Regression Diagnostics</title><content type='html'>This tutorial describes the implementation of tools for the diagnostic and the assessment of a logistic regression. These tools are available in Tanagra version 1.4.33 (and later).&lt;br /&gt;&lt;br /&gt;We deal with a credit scoring problem. We try to determine by using logistic regression the factors underlying the agreement or refusal of a credit to customers. We perform the following steps:&lt;br /&gt;- Estimating the parameters of the classifier;&lt;br /&gt;- Retrieving the covariance matrix of coefficients;&lt;br /&gt;- Assessment using the Hosmer and Lemeshow goodness of fit test;&lt;br /&gt;- Assessment using the reliability diagram;&lt;br /&gt;- Assessment using the ROC curve;&lt;br /&gt;- Analysis of residuals, detection of outliers and influential points.&lt;br /&gt;&lt;br /&gt;On the one hand, we use &lt;span style="color: rgb(51, 102, 255);"&gt;Tanagra 1.4.33&lt;/span&gt;. Then, on the other hand, we perform the same analysis using the &lt;span style="color: rgb(0, 153, 0);"&gt;R 2.9.2 software [glm(.) procedure]&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: logistic regression, residual analysis, outliers, influential points, pearson residual, deviance residual, leverage, cook's distance, dfbeta, dfbetas, hosmer-lemeshow goodness of fit test, reliability diagram, calibration plot, glm()&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;:  BINARY LOGISTIC REGRESSION, HOSMER LEMESHOW TEST, RELIABILITY DIAGRAM,  LOGISTIC REGRESSION RESIDUALS&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Logistic_Regression_Diagnostics.pdf" target="_blank"&gt;en_Tanagra_Logistic_Regression_Diagnostics.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/logistic_regression_diagnostics.zip" target="_blank"&gt;logistic_regression_diagnostics.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;  :&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm" target="_blank"&gt;Logistic Regression&lt;/a&gt;"&lt;br /&gt;D. Hosmer, S. Lemeshow, « Applied Logistic  Regression », John Wiley &amp;amp;Sons, Inc, Second Edition, 2000.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4326915598201712548?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4326915598201712548'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4326915598201712548'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/05/logistic-regression-diagnostics.html' title='Logistic Regression Diagnostics'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6578202565102628458</id><published>2010-05-20T15:02:00.000-07:00</published><updated>2010-05-20T22:16:26.084-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Feature Construction'/><title type='text'>Discretization of continuous features</title><content type='html'>The discretization transforms a continuous attribute into a discrete one. To do that, it partitions the range into a set of intervals by defining a set of cut points. Thus we must answer to two questions to lead this data transformation: (1) how to determine the right number of intervals; (2) how to compute the cut points. The resolution is not necessarily in that sequence.&lt;br /&gt;&lt;br /&gt;The best discretization is the one performed by an expert domain. Indeed, he takes into account other information than those only provided by the available dataset. Unfortunately, this kind of approach is not always feasible because: often, the domain knowledge is not available or it does not allow to determine the appropriate discretization; the process cannot be automated to handle a large number of attributes. So, we are often forced to found the determination of the best discretization on a numerical process.&lt;br /&gt;&lt;br /&gt;Discretization of continuous features as preprocessing for supervised learning process. First, we must define the context in which we perform the transformation. Depending on the circumstances, it is clear that the process and criteria used will not be the same. In this tutorial, we are in the supervised learning framework. We perform the discretization prior to the learning process i.e. we transform the continuous predictive attributes into discrete before to present them to a supervised learning algorithm. In this context, the construction of intervals in which one and only one of the values of the target attribute is the most represented is desirable. The relevance of the computed solution is often evaluated through an impurity based or an entropy based functions.&lt;br /&gt;&lt;br /&gt;In this tutorial, we use only the univariate approaches. We compare the behavior of the supervised and the unsupervised algorithms on an artificial dataset. We use several tools for that: &lt;span style="color: rgb(51, 51, 255);"&gt;Tanagra 1.4.35&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;Sipina 3.3&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;R 2.9.2&lt;/span&gt; (package dprep), &lt;span style="color: rgb(0, 153, 0);"&gt;Weka 3.6.0&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;Knime 2.1.1&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;Orange 2.0b&lt;/span&gt; and &lt;span style="color: rgb(0, 153, 0);"&gt;RapidMiner 4.6.0&lt;/span&gt;. We highlight the settings of the algorithms and the reading of the results.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords:&lt;/b&gt; mdlpc, discretization, supervised learning, equal frequency intervals, equal width intervals&lt;br /&gt;&lt;b&gt;Components:&lt;/b&gt; MDLPC, Supervised Learning, Decision List&lt;br /&gt;&lt;b&gt;Tutorial:&lt;/b&gt; &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Discretization_for_Supervised_Learning.pdf" target="_blank"&gt;en_Tanagra_Discretization_for_Supervised_Learning.pdf&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Dataset:&lt;/b&gt; &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/data-discretization.arff" target="_blank"&gt;data-discretization.arff&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References : &lt;o:p&gt;&lt;/o:p&gt;&lt;/b&gt;&lt;br /&gt;F.  Muhlenbach, R. Rakotomalala, « Discretization of Continuous Attributes  », in Encyclopedia of Data Warehousing and Mining, John Wang (Ed.), pp.  397-402, 2005 (&lt;a href="http://hal.archives-ouvertes.fr/hal-00383757/fr/" target="_blank"&gt;http://hal.archives-ouvertes.fr/hal-00383757/fr/&lt;/a&gt;).&lt;br /&gt;Tanagra Tutorial, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/discretization-and-naive-bayes.html"&gt;Discretization and Naive Bayes Classifier&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6578202565102628458?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6578202565102628458'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6578202565102628458'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/05/discretization-of-continuous-features.html' title='Discretization of continuous features'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1006811416263408525</id><published>2010-05-16T11:58:00.000-07:00</published><updated>2010-05-16T12:12:12.042-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Sipina Decision Graph Algorithm (case study)</title><content type='html'>SIPINA is a data mining tool. But it is also a machine learning method. It corresponds to an algorithm for the induction of decision graphs (see References, section 9). A decision graph is a generalization of a decision tree  where we can merge any two terminal nodes of the graph, and not only the leaves issued from the same node.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;The SIPINA method is &lt;/span&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;only &lt;/span&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;available under the version 2.5 of SIPINA data mining tool&lt;/span&gt;. This version has some drawbacks. Among others, it cannot handle large datasets (higher than 16.383 instances). But it is the only tool which implements the decision graphs algorithm. This is the main reason for which this version is available online to date. If we want to implement a decision tree algorithm such as C4.5 or CHAID, or if we want to create interactively a decision tree , it is more advantageous to use the research version (named also version 3.0). The research version is more powerful and it supplies much functionality for the data exploration.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to implement the Sipina decision graph algorithm with the Sipina software version 2.5. We want to predict the low birth weight of newborns from the characteristics of their mothers. We want foremost to show how to use this 2.5 version which is not well documented. We want also to point out the interest of the decision graphs when we treat a small dataset i.e. when the data fragmentation becomes a crucial problem.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: decision graphs, decision trees, sipina version 2.5&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/en_sipina_method.pdf" target="_blank"&gt;en_sipina_method.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/dataset/low_birth_weight_v4.xls" target="_blank"&gt;low_birth_weight_v4.xls&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Decision_tree_learning#Extending_decision_trees_with_decision_graphs" target="_blank"&gt;Decision tree learning&lt;/a&gt;"&lt;br /&gt;J. Oliver, Decision Graphs: An extension of Decision Trees, in Proc. of Int. Conf. on Artificial Intelligence and Statistics, 1993.&lt;br /&gt;R. Rakotomalala, Graphes d'induction, PhD Dissertation, University Lyon 1, 1997 (URL: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/publications.html" target="_blank"&gt;http://eric.univ-lyon2.fr/~ricco/publications.html&lt;/a&gt;; in french).&lt;br /&gt;D. Zighed, R. Rakotomalala, Graphes d'induction : Apprentissage et Data Mining, Hermes, 2000 (in French).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1006811416263408525?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1006811416263408525'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1006811416263408525'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/05/sipina-decision-graph-algorithm-case.html' title='Sipina Decision Graph Algorithm (case study)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2804163829007233425</id><published>2010-05-13T23:15:00.000-07:00</published><updated>2010-05-13T23:27:12.169-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>User's guide for the old Sipina 2.5 version</title><content type='html'>SIPINA has a long history. Before the current version (version 3.3, May 2010), we distributed a data mining tool dedicated exclusively to the induction of decision graphs, a generalization of decision trees. Of course, the state-of-the-art decision trees algorithms are also included (such as C4.5, CHAID).&lt;br /&gt;&lt;br /&gt;This version, called 2.5, is online since 1995. Its development was suspended in 1998 when I started programming the version 3.0.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;This version 2.5 is the only free tool which implements the decision graphs algorithm&lt;/span&gt;. This is a real curiosity in this respect. This is the reason for which I still distribute this version to date.&lt;br /&gt;&lt;br /&gt;On the other hand, this 2.5 version has some severe limitations. Among others, it can handle only small dataset, up to 16.380 instances. If you want to implement a decision tree or if you want to handle a large dataset, it is always advised to use the current version (version 3.0 and later).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Setup of the old 2.5 version&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/softs/Setup_Sipina_V25.exe" target="_blank"&gt;Setup_Sipina_V25.exe&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;User's guide&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/softs/EnglishDocSipinaV25.pdf" target="_blank"&gt;EnglishDocSipinaV25.pdf&lt;br /&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;J. Oliver, "Decision Graphs - An Extension of Decision Trees", in Proc. Of the 4-th Int. workshop on Artificial Intelligence and Statistics, pages 343-350, 1993.&lt;br /&gt;R. Rakotomalala, "&lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/Graphes_Induction_These_Rakotomalala_1997.pdf" target="_blank"&gt;Induction Graphs&lt;/a&gt;", PhD Thesis, University of Lyon 1, 1997 (in French).&lt;br /&gt;D. Zighed, R. Rakotomalala, "Graphes d'Induction - Apprentissage et Data Mining", Hermes, 2000 (in French).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2804163829007233425?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2804163829007233425'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2804163829007233425'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/05/users-guide-for-old-sipina-25-version.html' title='User&apos;s guide for the old Sipina 2.5 version'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6453773133248611080</id><published>2010-05-10T06:36:00.000-07:00</published><updated>2010-05-10T06:39:52.199-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='PLS Regression'/><category scheme='http://www.blogger.com/atom/ns#' term='Regression analysis'/><title type='text'>Solutions for multicollinearity in multiple regression</title><content type='html'>Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with others (Wikipedia). Sometimes the signs of the coefficients are inconsistent with the domain knowledge; sometimes, explanatory variables which seems individually significant are invalidated when we add other variables.&lt;br /&gt;&lt;br /&gt;There are two steps when we want to treat this kind of problem: (1) detecting the presence of the collinearity; (2) implementing solutions in order to obtain more consistent results.&lt;br /&gt;&lt;br /&gt;In this tutorial, we study three approaches to avoid the multicollinearity problem: the variable selection; the regression on the latent variables provided by PCA (principal component analysis); the PLS regression (partial least squares).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; linear regression, multiple regression, collinearity, multicollinearity, principal component analysis, PCA, PLS regression&lt;br /&gt;&lt;strong&gt;Component :&lt;/strong&gt; Multiple  linear regression, Linear Correlation, Forward Entry Regression,  Principal Component Analysis, PLS Regression, PLS Selection, PLS Conf.  Interval&lt;br /&gt;&lt;strong&gt;Tutorial:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Regression_Colinearity.pdf" target="_blank"&gt;en_Tanagra_Regression_Colinearity.pdf&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/car_consumption_colinearity_regression.xls" target="_blank"&gt;car_consumption_colinearity_regression.xls&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References  : &lt;/strong&gt;&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Multicollinearity" target="_blank"&gt;Multicollinearity&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6453773133248611080?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6453773133248611080'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6453773133248611080'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/05/solutions-for-multicollinearity-in.html' title='Solutions for multicollinearity in multiple regression'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3127643883353858241</id><published>2010-04-25T21:20:00.000-07:00</published><updated>2010-04-25T21:27:11.198-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Feature Construction'/><title type='text'>Linear discriminant analysis on PCA factors</title><content type='html'>In this tutorial, we show that in certain circumstances, it is more convenient to use the factors computed from a principal component analysis (from the original attributes) as input features for the linear discriminant analysis algorithm.&lt;br /&gt;&lt;br /&gt;The new representation space maintains the proximity between the examples. The new features known as "factors" or "latent variables", which are a linear combination of the original descriptors, have several advantageous properties: (a) their interpretation very often allows to detect patterns in the initial space; (b) a very reduced number of factors allows to restore information contained in the data, we can moreover remove the noise from the dataset by using only the most relevant factors (it is a sort of regularization by smoothing the information provided by the dataset); (c) the new features form an orthogonal basis, learning algorithms such as linear discriminant analysis have a better behavior.&lt;br /&gt;&lt;br /&gt;This approach has a connection to the reduced-rank linear discriminant analysis. But, instead to this last one, the class information is not needed during the computations of the principal components. The computation can be very fast using an appropriate algorithm when we deal with very high-dimensional dataset (such as NIPALS). But, on the other hand, it seems that the standard reduced-rank LDA tends to be better in terms of classification accuracy.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; linear discriminant analysis, principal component analysis, reduced-rank linear discriminant analysis&lt;br /&gt;&lt;strong&gt;Components:&lt;/strong&gt; Supervised Learning,  Linear discriminant analysis, Principal Component Analysis, Scatterplot,  Train-test&lt;br /&gt;&lt;strong&gt;Tutorial:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_dr_utiliser_axes_factoriels_descripteurs.pdf" target="_blank"&gt;en_dr_utiliser_axes_factoriels_descripteurs.pdf&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/dr_waveform.bdm" target="_blank"&gt;dr_waveform.bdm&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Linear_discriminant_analysis" target="_blank"&gt;Linear discriminant analysis&lt;/a&gt;".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3127643883353858241?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3127643883353858241'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3127643883353858241'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/04/linear-discriminant-analysis-on-pca.html' title='Linear discriminant analysis on PCA factors'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-509414641670518430</id><published>2010-04-21T22:56:00.001-07:00</published><updated>2010-04-21T23:00:35.652-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Induction of fuzzy rules using Knime</title><content type='html'>This tutorial is the continuation of the one devoted to the induction of decision rules &lt;a href="http://data-mining-tutorials.blogspot.com/2010/02/supervised-rule-induction-software.html"&gt;(Supervised rule induction - Software comparison&lt;/a&gt;). I have not included Knime in the comparison because it implements a method which is different compared with the other tools. Knime computes fuzzy rules. It wants that the target variable is continuous. That seems rather mysterious in the supervised learning context where the class attribute is usually discrete. I thought it was more appropriate to detail the implementation of the method in a tutorial that is exclusively devoted to the &lt;span style="color: rgb(0, 153, 0);"&gt;Knime&lt;/span&gt; rule learner (version 2.1.1).&lt;br /&gt;&lt;br /&gt;Especially, it is important to detail the reason of the data preparation and the reading of the results. To have a reference, we compare the results with those provided by the rule induction tool proposed by &lt;span style="color: rgb(51, 51, 255);"&gt;Tanagra&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Scientific papers about the method are available on line.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: induction of rules, supervised learning, fuzzy rules&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt;: SAMPLING, RULE INDUCTION, TEST&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Induction_Regles_Floues_Knime.pdf" target="_blank"&gt;en_Tanagra_Induction_Regles_Floues_Knime.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/iris2D.txt" target="_blank"&gt;iris2D.txt&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;  :&lt;br /&gt;M.R. Berthold, « Mixed fuzzy rule formation », International  Journal of Approximate Reasonning, 32, pp. 67-84, 2003.&lt;br /&gt;T.R. Gabriel,  M.R. Berthold, « Influence of fuzzy norms and other heuristics on mixed  fuzzy rule formation », International Journal of Approximate Reasoning,  35, pp.195-202, 2004.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-509414641670518430?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/509414641670518430'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/509414641670518430'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/04/induction-of-fuzzy-rules-using-knime.html' title='Induction of fuzzy rules using Knime'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3323366687966519762</id><published>2010-04-16T00:57:00.000-07:00</published><updated>2010-04-16T01:04:12.535-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Feature Selection'/><title type='text'>"Wrapper" for feature selection (continuation)</title><content type='html'>This tutorial is the continuation of the preceding one about the wrapper feature selection in the supervised learning context (http://data-mining-tutorials.blogspot.com/2010/03/wrapper-for-feature-selection.html). We analyzed the behavior of &lt;span style="color: rgb(51, 102, 255);"&gt;Sipina&lt;/span&gt;, and we have described the source code for the wrapper process (forward search) under &lt;span style="color: rgb(51, 102, 255);"&gt;R&lt;/span&gt; (http://www.r-project.org/). Now, we show the utilization of the same principle under &lt;span style="color: rgb(0, 153, 0);"&gt;Knime 2.1.1&lt;/span&gt;, &lt;span style="color: rgb(0, 153, 0);"&gt;Weka 3.6.0&lt;/span&gt; and &lt;span style="color: rgb(0, 153, 0);"&gt;RapidMiner 4.6&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The approach is as follows: (1) we use the training set for the selection of the most relevant variables for classification; (2) we learn the model on selected descriptors; (3) we assess the performance on a test set containing all the descriptors.&lt;br /&gt;&lt;br /&gt;This third point is very important. We cannot know the variables that will be finally selected. We do not have to manually prepare the test file by including only those which have been selected by the wrapper procedure. This is essential for the automation of the process. Indeed, otherwise, each change of setting in the wrapper procedure leading to another subset of descriptors would require us to manually edit the test file. This is very tedious.&lt;br /&gt;&lt;br /&gt;In the light of this specification, it appeared that only Knime was able to implement the complete process. With the other tools, it is possible to select the relevant variables on the training file. But, I could not (or I did not know) apply the model on a test file containing all the original variables.&lt;br /&gt;&lt;br /&gt;The naive bayes classifier is the learning method used in this tutorial .&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: feature selection, supervised learning, naive  bayes classifier, wrapper, knime, weka, rapidminer&lt;br /&gt;&lt;strong&gt;&lt;/strong&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;:  &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Wrapper_Continued.pdf" target="_blank"&gt;en_Tanagra_Wrapper_Continued.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;:  &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/mushroom_wrapper.zip" target="_blank"&gt;mushroom_wrapper.zip&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References&lt;/strong&gt;   :&lt;br /&gt;&lt;a href="http://jmlr.csail.mit.edu/papers/special/feature03.html" target="_blank"&gt;JMLR Special Issue on Variable and Feature Selection -   2003&lt;/a&gt;&lt;br /&gt;R Kohavi, G. John, « &lt;a href="http://citeseer.ist.psu.edu/cache/papers/cs/1999/http:zSzzSzrobotics.stanford.eduzSz%7EronnykzSzwrappers-chapter.pdf/kohavi98wrapper.pdf/" target="_blank"&gt;The wrapper approach&lt;/a&gt; », 1997.&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier" target="_blank"&gt;Naive bayes classifier&lt;/a&gt;".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3323366687966519762?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3323366687966519762'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3323366687966519762'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/04/wrapper-for-feature-selection.html' title='&quot;Wrapper&quot; for feature selection (continuation)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3099475216909191828</id><published>2010-03-30T09:57:00.001-07:00</published><updated>2010-03-30T10:00:33.756-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><category scheme='http://www.blogger.com/atom/ns#' term='Feature Selection'/><title type='text'>"Wrapper" for feature selection</title><content type='html'>The feature selection is a crucial aspect of supervised learning process. We must determine the relevant variables for the prediction of the target variable. Indeed, a simpler model is easier to understand and interpret; the deployment will be facilitated, we need less information to collect for prediction; finally, a simpler model is often more robust in generalization i.e. when we want to classify an unseen instance from the population.&lt;br /&gt;&lt;br /&gt;Three kinds of approaches are often highlighted into the literature. Among them, the WRAPPER approach uses explicitly a performance criterion during the search of the best subset of descriptors. Most often, this is the error rate. But in reality, any kind of criteria can be used. This may be the cost if we use a misclassification cost matrix. It can be the area under curve (AUC) when we assess the classifier using ROC curves, etc. In this case, the learning method is considered as a black box. We try various subsets of predictors. We will choose the one that optimizes the criterion.&lt;br /&gt;&lt;br /&gt;In this tutorial, we implement the WRAPPER approach with SIPINA and R 2.9.2. For this last one, we give the source code for a forward search strategy. The readers can easily adapt the program to other dataset. Moreover, a careful reading of the source code for R gives a better understanding about the calculations made internally by SIPINA.&lt;br /&gt;&lt;br /&gt;The WRAPPER strategy is a priori the best since it explicitly optimizes the performance criterion. We verify this by comparing the results with those provided by the FILTER approach (FCBF method) available into TANAGRA. The conclusions are not as obvious as one can think.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: feature selection, supervised learning, naive bayes classifier, wrapper, fcbf, sipina, R software, RWeka paclage&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: DISCRETE SELECT  EXAMPLES, FCBF FILTERING, NAIVE BAYES, TEST&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Sipina_Wrapper.pdf" target="_blank"&gt;en_Tanagra_Sipina_Wrapper.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/mushroom_wrapper.zip" target="_blank"&gt;mushroom_wrapper.zip&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References&lt;/strong&gt;  :&lt;br /&gt;&lt;a href="http://jmlr.csail.mit.edu/papers/special/feature03.html" target="_blank"&gt;JMLR Special Issue on Variable and Feature Selection -  2003&lt;/a&gt;&lt;br /&gt;R Kohavi, G. John, « &lt;a href="http://citeseer.ist.psu.edu/cache/papers/cs/1999/http:zSzzSzrobotics.stanford.eduzSz%7EronnykzSzwrappers-chapter.pdf/kohavi98wrapper.pdf/" target="_blank"&gt;The wrapper approach&lt;/a&gt; », 1997.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3099475216909191828?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3099475216909191828'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3099475216909191828'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/03/wrapper-for-feature-selection.html' title='&quot;Wrapper&quot; for feature selection'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7411521367586985240</id><published>2010-03-23T09:39:00.000-07:00</published><updated>2010-03-23T09:41:11.984-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.36</title><content type='html'>&lt;span style="font-weight: bold; color: rgb(0, 153, 0);"&gt;ReliefF&lt;/span&gt; is a component for automatic variable selection in a supervised learning task. It can handle both continuous and discrete descriptors. It can be inserted before any supervised method.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;Naive Bayes&lt;/span&gt; was modified. It now described a prediction model in an explicit form (in a linear combination form), easy to understand and to deploy.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7411521367586985240?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7411521367586985240'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7411521367586985240'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/03/tanagra-version-1436.html' title='Tanagra - Version 1.4.36'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3171657054469859571</id><published>2010-02-11T02:04:00.000-08:00</published><updated>2010-02-11T02:08:31.633-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Supervised rule induction - Software comparison</title><content type='html'>Supervised rule induction methods play an important role in the Data Mining framework. Indeed, it provides an easy to understand classifier. A rule uses the following representation: "IF premise THEN conclusion" (e.g. IF an account problem is reported on a client THEN the credit is not accepted).&lt;br /&gt;&lt;br /&gt;Among the rule induction methods, the "separate and conquer" approaches are very popular during the 90's. Curiously, they are less present today into proceedings or journals. More troublesome still, they are not implemented in commercial software. They are only available in free tools from the Machine Learning community. However, they have several advantages compared to other techniques.&lt;br /&gt;&lt;br /&gt;In this tutorial, we describe first two separate and conquer algorithms for the rule induction process. Then, we show the behavior of the classification rules algorithms implemented in various tools such as &lt;span style="color: rgb(51, 204, 0);"&gt;Tanagra 1.4.34&lt;/span&gt;, &lt;span style="color: rgb(51, 204, 0);"&gt;Sipina Research 3.3&lt;/span&gt;, &lt;span style="color: rgb(51, 204, 0);"&gt;Weka 3.6.0&lt;/span&gt;, &lt;span style="color: rgb(51, 204, 0);"&gt;R 2.9.2&lt;/span&gt; with the RWeka package, &lt;span style="color: rgb(51, 204, 0);"&gt;RapidMiner 4.6&lt;/span&gt;, or &lt;span style="color: rgb(51, 204, 0);"&gt;Orange 2.0b&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: rule induction, separate and conquer, top-down, CN2, decision tree&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Composants&lt;/span&gt;  : SAMPLING, DECISION LIST, RULE INDUCTION, TEST&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Rule_Induction.pdf" target="_blank"&gt;en_Tanagra_Rule_Induction.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/life_insurance.zip" target="_blank"&gt;life_insurance.zip&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;J. Furnkranz, "Separate-and-conquer Rule  Learning", Artificial Intelligence Review, Volume 13, Issue 1, pages  3-54, 1999.&lt;br /&gt;P. Clark, T. Niblett, "The CN2 Rule Induction Algorithm",  Machine Learning, 3(4):261-283, 1989.&lt;br /&gt;P. Clark, R. Boswell, "Rule  Induction with CN2: Some recent improvements", Machine Learning -  EWSL-91, pages 151-163, Springer Verlag, 1991.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3171657054469859571?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3171657054469859571'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3171657054469859571'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/02/supervised-rule-induction-software.html' title='Supervised rule induction - Software comparison'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4712669683977530822</id><published>2010-01-18T23:18:00.000-08:00</published><updated>2010-01-18T23:19:39.880-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.35</title><content type='html'>&lt;p&gt;&lt;b&gt;CTP&lt;/b&gt;. The method of detection of the right size of the tree is modified for the "Clustering Tree" with post-pruning component (CTP). It relies both on the angle between half-lines at each point on the curve of decreasing the WSS (within-group sum of squares) on the growing sample and the decrease of the same indicator computed on the pruning sample. Compared to the previous implementation, it results in a smaller number of clusters.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Regression Tree&lt;/b&gt;. The previous modification is incorporated into the Regression Tree component which is a univariate version of CTP.&lt;/p&gt;&lt;p&gt;&lt;b&gt;C-RT Regression Tree&lt;/b&gt;. A new regression tree component was added. It faithfully implements the technique described in the Breiman's and al. (1984) book, including the post-pruning part with the 1-SE Rule (Chapter 8, especially p. 226 about the formula for the variance of the MSE).&lt;/p&gt;&lt;p&gt;&lt;b&gt;C-RT&lt;/b&gt;. The report of the induction of decision tree C-RT has been completed. Based on the last column of the post-pruning table, it becomes easier to choose the parameter x (in x-SE Rule) to arbitrarily define the size of the pruned tree.&lt;/p&gt;&lt;p&gt;Some tutorials will describe these various changes soon.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4712669683977530822?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4712669683977530822'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4712669683977530822'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/01/tanagra-version-1435.html' title='Tanagra - Version 1.4.35'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-40363561890475837</id><published>2010-01-04T05:39:00.000-08:00</published><updated>2010-01-11T21:13:52.251-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Dealing with very large dataset in Sipina</title><content type='html'>The ability to handle large databases is a crucial problem in the data mining context. We want to handle a large dataset in order to detect the hidden information. Most of the free data mining tools have problems with large dataset because they load all the instances and variables into memory. Thus, the limitation of these tools is the available memory.&lt;br /&gt;&lt;br /&gt;To overcome this limitation, we should design solutions that allow to copy all or part of the data on disk, and perform treatments by loading into memory only what is necessary at each step of the algorithm (the instances and/or the variables). If the solution is theoretically simple, it is difficult in practice. Indeed, the processing time should remain reasonable even if we increase the disk access. It is very difficult to implement a strategy that is effective regardless of the learning algorithm used (supervised learning, clustering, factorial analysis, etc.). They handle the data in very different way: some of them use intensively matrix operations; the others search mainly the co-occurrence between attribute-value pairs, etc.&lt;br /&gt;&lt;br /&gt;In this tutorial, we present a specific solution in the induction tree context. The solution is integrated into SIPINA (as optional) because its internal data structure is especially intended to the decision tree induction. Developing an approach which takes advantages of the specificities of the learning algorithm was easy in this context. We show that it is then possible to handle a very large dataset (&lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;41 variables&lt;/span&gt; and &lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;9,634,198 observations&lt;/span&gt;) and to use all the functionalities of the tool (interactive construction of the tree, local descriptive statistics on nodes, etc.).&lt;br /&gt;&lt;br /&gt;To fully appreciate the solution proposed by Sipina, we compare its behavior to generalist data mining tools such as &lt;span style="color: rgb(153, 0, 0); font-weight: bold;"&gt;Tanagra 1.4.33&lt;/span&gt; or &lt;span style="color: rgb(153, 0, 0); font-weight: bold;"&gt;Knime 2.03&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: very large dataset, decision tree, sampling, sipina, knime&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: ID3&lt;br /&gt;&lt;strong&gt;Lien&lt;/strong&gt; : &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Sipina_Large_Dataset.pdf" target="_blank"&gt;en_Sipina_Large_Dataset.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Données&lt;/strong&gt; : &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/twice-kdd-cup-discretized-descriptors.zip" target="_blank"&gt;twice-kdd-cup-discretized-descriptors.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Références&lt;/strong&gt; :&lt;br /&gt;Tanagra, « &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-large-dataset.html"&gt;Decision tree and large dataset&lt;/a&gt; ».&lt;br /&gt;Tanagra, « L&lt;a href="http://data-mining-tutorials.blogspot.com/2009/10/local-sampling-approach-for-decision.html"&gt;ocal sampling for decision tree learning&lt;/a&gt; »&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-40363561890475837?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/40363561890475837'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/40363561890475837'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/01/dealing-with-very-large-dataset-in.html' title='Dealing with very large dataset in Sipina'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2075493981494483619</id><published>2010-01-02T01:58:00.000-08:00</published><updated>2010-01-02T02:02:04.981-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><title type='text'>CART - Determining the right size of the tree</title><content type='html'>Determining the appropriate size of the tree is a crucial task in the decision tree learning process. It determines its performance during the deployment into the population (the generalization process). There are two situations to avoid: the under-sized tree, too small, poorly capturing relevant information in the training set; the over-sized tree capturing specific information of the training set, which specificities are not relevant to the population. In both cases, the prediction model performed poorly during the generalization phase.&lt;br /&gt;&lt;br /&gt;Among the many variants of decision trees learning algorithms, CART is probably the one that detects better the right size of the tree.&lt;br /&gt;&lt;br /&gt;In this tutorial, we describe the selection mechanism used by CART during the post-pruning process. We show also how to set the appropriate value of the parameter of the algorithm in order to obtain a specific (a user-defined) tree.&lt;br /&gt;&lt;br /&gt;Keywords: decision tree, CART, 1-SE Rule, post-pruning&lt;br /&gt;&lt;strong&gt;Components:&lt;/strong&gt; Discrete select examples, Supervised  Learning, C-RT, Test&lt;br /&gt;&lt;strong&gt;Tutorial:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Tree_Post_Pruning.pdf" target="_blank"&gt;en_Tanagra_Tree_Post_Pruning.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/adult_cart_decision_trees.zip" target="_blank"&gt;adult_cart_decision_trees.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References  :&lt;/strong&gt;&lt;br /&gt;L. Breiman, J. Friedman, R. Olshen, C. Stone, " Classification and Regression Trees ", California : Wadsworth International, 1984.&lt;br /&gt;R.  Rakotomalala, " Arbres de décision ", Revue Modulad, 33, 163-187, 2005 (&lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/tutoriel_arbre_revue_modulad_33.pdf" target="_blank"&gt;tutoriel_arbre_revue_modulad_33.pdf&lt;/a&gt;)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2075493981494483619?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2075493981494483619'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2075493981494483619'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2010/01/cart-determining-right-size-of-tree.html' title='CART - Determining the right size of the tree'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8966920901492095105</id><published>2009-12-23T22:48:00.000-08:00</published><updated>2009-12-23T22:53:46.431-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Exploratory Data Analysis'/><title type='text'>VARIMAX rotation in Principal Component Analysis</title><content type='html'>A VARIMAX rotation is a change of coordinates used in principal component analysis  (PCA) that maximizes the sum of the variances of the squared loadings. Thus, all the coefficients (squared correlation with factors) will be either large or near zero, with few intermediate values.&lt;br /&gt;&lt;br /&gt;The goal is to associate each variable to at most one factor. The interpretation of the results of the PCA will be simplified. Then each variable will be associated to one and one only factor, they are split (as much as possible) into disjoint sets.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to perform this kind of rotation from the results of a standard PCA in Tanagra.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; PCA, principal component analysis, VARIMAX, QUARTIMAX&lt;br /&gt;&lt;strong&gt;Components :&lt;/strong&gt; Principal Component Analysis, Factor Rotation&lt;br /&gt;&lt;strong&gt;Tutorial:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Pca_Varimax.pdf" target="_blank"&gt;en_Tanagra_Pca_Varimax.pdf&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/crime_dataset_from_DASL.xls" target="_blank"&gt;crime_dataset_from_DASL.xls&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Varimax_rotation" target="_blank"&gt;Varimax rotation&lt;/a&gt;"&lt;br /&gt;H. Abdi, "&lt;a href="http://www.utd.edu/%7Eherve/Abdi-rotations-pretty.pdf" target="_blank"&gt;Factor rotations in Factor Analyses&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8966920901492095105?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8966920901492095105'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8966920901492095105'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/12/varimax-rotation-in-principal-component.html' title='VARIMAX rotation in Principal Component Analysis'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5630831908638594898</id><published>2009-12-19T21:28:00.000-08:00</published><updated>2009-12-19T21:32:53.357-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Kruskal–Wallis one-way analysis of variance</title><content type='html'>The tests for comparison of population try to determine if K (K   2) samples come from the same underlying population according to a dependent variable (X). In other words, we try to determine if the underlying distribution of X is the same whatever the group.&lt;br /&gt;&lt;br /&gt;We talk about non parametric tests when we do not make assumption about the shape of the distribution of the dependent variable. They are considered as being "distribution free" methods, at the opposite of the parametric approaches.&lt;br /&gt;&lt;br /&gt;In this tutorial, we implement various tests for differences in location. The Kruskal-Wallis test is certainly the most used one when we try to determine if the scores among groups are stochastically the same. But other tests exist. We compare the results obtained. We will complete the analysis by conducting multiple comparisons in order to identify groups that differ significantly from each other.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: non parametric test, independent samples, Kruskal-Wallis, Van der Waerden, Fisher-Yates-Terry-Hoeffding, median test, tests for differences in location&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, FYTH 1-WAY ANOVA&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Nonparametric_Test_KW_and_related.pdf" target="_blank"&gt;en_Tanagra_Nonparametric_Test_KW_and_related.pdf&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/wine_evaluation_nonparametric.xls" target="_blank"&gt;wine_evaluation_nonparametric.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;R. Lowry, « Concepts and Applications of Inferential Statistics », &lt;a href="http://faculty.vassar.edu/lowry/ch14a.html" target="_blank"&gt;SubChapter 14a&lt;/a&gt;. The Kruskal-Wallis Test for 3 or More Independent Samples.&lt;br /&gt;Wikipedia. &lt;a href="http://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance" target="_blank"&gt;Kruskal–Wallis one-way analysis of variance&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5630831908638594898?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5630831908638594898'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5630831908638594898'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/12/kruskalwallis-one-way-analysis-of.html' title='Kruskal–Wallis one-way analysis of variance'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3519676778641821964</id><published>2009-12-17T01:21:00.001-08:00</published><updated>2009-12-17T01:28:43.390-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Tests for differences in scale</title><content type='html'>&lt;span style="color: rgb(153, 0, 0);"&gt;Parametric and non parametric tests for differences in scale.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The tests of equal variability (or dispersion, or scale, or simply variance) are often presented as a preliminary test before the comparison of means, in order to verify the homoscedasticity assumption. But this is not their only purpose. Compare dispersions can be an end in itself. For example, we wish to compare the performance of two systems of heating. The average temperature at the center of the room is the same; however one can wish to compare the mode of diffusion of heat in different parts of the room.&lt;br /&gt;&lt;br /&gt;The parametric tests are based primarily on the Gaussian distribution. The test becomes a test for homogeneity of variance. We highlight the Levene test in this tutorial. Other tests exist (Bartlett test for instance), we mention them in this tutorial.&lt;br /&gt;&lt;br /&gt;When the normality assumption is questionable, when sample size is low, when the variable is ordinal and not continuous, it is more appropriate to use non parametric tests. These are called tests for equality of scales or dispersions. In fact the procedures are not based on estimated variances. We will use well known techniques such as the Ansari-Bradley test, the Mood or the Klotz test. They have a scope broader since nonparametric. Some of these tests have a drawback, they are not applicable when the distributions conditionals do not share the same parameter of central tendency (the median in general, but we can adjust the values by centering in relation to the median).&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to implement these various tests with Tanagra.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: parametric test, non parametric test, independent samples, Levene test, Bartlett test, Brown-Forsythe test, Mood test, Klotz test, Ansari-Bradley test&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: LEVENE’S TEST, ANSARI-BRADLEY  SCALE TEST, MOOD SCALE TEST, KLOTZ SCALE TEST&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Nonparametric_Test_for_Scale_Differences.pdf" target="_blank"&gt;en_Tanagra_Nonparametric_Test_for_Scale_Differences.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/tests_for_scale_differences.xls" target="_blank"&gt;tests_for_scale_differences.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;NIST, "Quantitative techniques", section 1.3.5 - http://www.itl.nist.gov/div898/handbook/eda/section3/eda35.htm&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3519676778641821964?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3519676778641821964'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3519676778641821964'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/12/parametric-and-non-parametric-tests-for.html' title='Tests for differences in scale'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8315216096781488919</id><published>2009-12-09T05:21:00.000-08:00</published><updated>2009-12-09T05:32:18.454-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Regression analysis'/><title type='text'>Outliers and influential points in regression</title><content type='html'>The analysis of outliers and influential points is an important step of the regression diagnostics. The goal is to detect (1) the points which are very different to the others (outliers) i.e. they seem do not belong to the analyzed population; or (2) the points that if they are removed (influential points), leads us to a different model. The distinction between these kinds of points is not always obvious.&lt;br /&gt;&lt;br /&gt;In this tutorial, we implement several indicators for the analysis of outliers and influential points. To avoid confusion about the definitions of indicators  (some indicators are calculated differently from one tool to another), we compare our results with state-of-the-art tool such as SAS and R. In a first step, we give the results described into the SAS documentation. In a second step, we describe the process and the results under Tanagra and R. In conclusion, we note that these tools give the same results.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; linear regression, outliers, influential points, standardized residuals, studentized residuals, leverage, dffits, cook's distance, covratio, dfbetas, R software&lt;br /&gt;&lt;strong&gt;Components:&lt;/strong&gt; Multiple linear regression, Outlier detection, DfBetas&lt;br /&gt;&lt;strong&gt;Tutorial:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Outlier_Influential_Points_for_Regression.pdf" target="_blank"&gt;en_Tanagra_Outlier_Influential_Points_for_Regression.pdf&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/USPopulation.xls" target="_blank"&gt;USPopulation.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References: &lt;/strong&gt;&lt;br /&gt;SAS STAT User’s Guide, « The REG Procedure – &lt;a href="http://v8doc.sas.com/sashtml/stat/chap55/sect33.htm#regprv" target="_blank"&gt;Predicted and Residual Values&lt;/a&gt; »&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8315216096781488919?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8315216096781488919'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8315216096781488919'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/12/outliers-and-influential-points-in.html' title='Outliers and influential points in regression'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3172181552173506727</id><published>2009-12-07T07:41:00.000-08:00</published><updated>2009-12-07T07:44:00.907-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Tests for comparing two related samples</title><content type='html'>Dependent samples, also called related samples or correlated samples, occur when the response of the nth person in the second sample is partly a function of the response of the nth person in the first sample. There are several common forms of sample dependency . (1) Before-after and other studies in which the same people are surveyed at different points in time, including panel studies. (2) Matched-pairs studies in which each of the subjects of the study is paired with each of those in a comparison group on the basis matching factors (e.g. age, sex, income, etc.). (3) The pairs can simply be inherent in the situation we are trying to analyze. For instance, one tries to compare the time spent watching television by the man and woman within a couple. The blocks are naturally households. Men and women should not be considered as independent observations.&lt;br /&gt;&lt;br /&gt;The aim of tests for related samples is to exclude from the analysis the within-group variation. The calculation of the differences is realized within each pair of subjects. In this tutorial, we show how to implement 3 tests for two related samples. Two of them are non-parametric (sign test and Wilcoxon matched-pairs ranks test), the last one is the parametric t-test for related samples.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: parametric test, non-parametric test, paired samples, sign test, wilcoxon signed rank test, paired samples t-test, normality test&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: SIGN TEST, WILCOXON SIGNED RANK TEST, PAIRED T-TEST, FORMULA, NORMALITY TEST&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Nonparametric_Test_for_Two_Related_Samples.pdf" target="_blank"&gt;en_Tanagra_Nonparametric_Test_for_Two_Related_Samples.pdf&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;Dataset&lt;/strong&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/comparison_2_related_samples.xls" target="_blank"&gt;comparison_2_related_samples.xls&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References&lt;/strong&gt; :&lt;br /&gt;R. Lowry, « Concepts and Applications of Inferential Statistics », &lt;a href="http://faculty.vassar.edu/lowry/ch12a.html" target="_blank"&gt;SubChapter 12a&lt;/a&gt;. The Wilcoxon Signed-Rank Test.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3172181552173506727?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3172181552173506727'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3172181552173506727'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/12/tests-for-comparing-two-related-samples.html' title='Tests for comparing two related samples'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8727733864228159845</id><published>2009-12-02T07:53:00.000-08:00</published><updated>2009-12-02T07:59:59.937-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Multivariate tests for comparing populations</title><content type='html'>&lt;div&gt;&lt;/div&gt;Multivariate parametric hypothesis testing for comparing populations.&lt;br /&gt;&lt;br /&gt;A multivariate test for comparison of population try to determine if K (K   2) samples come from the same underlying population according to a set of variables of interest (X1,…,Xp).&lt;br /&gt;&lt;br /&gt;We talk about parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the data is drawn from a multivariate Gaussian distribution, the hypothesis testing relies on mean vector or on covariance matrix.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: Hotelling's T2, Wilks' Lambda, Box’s M test, Bartlett's test, mean vector, covariance matrix, MANOVA&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: UNIVARIATE CONTINUOUS STAT, HOTELLING’S T2, HOTELLING’S T2 HETEROSCEDASTIC, BOX’S M TEST, ONE-WAY MANOVA&lt;br /&gt;&lt;strong&gt;Lien&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Multivariate_Parametric_Tests.pdf" target="_blank"&gt;en_Tanagra_Multivariate_Parametric_Tests.pdf&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/credit_approval.xls" target="_blank"&gt;credit_approval.xls&lt;br /&gt;&lt;/a&gt;&lt;strong&gt;References&lt;/strong&gt; :&lt;br /&gt;S. Rathburn, A. Wiesner, "&lt;a href="http://www.stat.psu.edu/online/development/stat505/" target="_blank"&gt;STAT 505: Applied Multivariate Statistical Analysis&lt;/a&gt;", The Pennsylvania State University.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8727733864228159845?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8727733864228159845'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8727733864228159845'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/12/multivariate-tests-for-comparing.html' title='Multivariate tests for comparing populations'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8929581468242970928</id><published>2009-11-30T06:05:00.000-08:00</published><updated>2009-11-30T06:13:58.617-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Parametric tests for comparing populations</title><content type='html'>&lt;div&gt;&lt;/div&gt;Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples.&lt;br /&gt;&lt;br /&gt;The tests for comparison of population try to determine if K (K &gt;= 2) samples come from the same underlying population according to a variable of interest (X). We talk parametric test when we assume that the data come from a type of probability distribution. Thus, the inference relies on the parameters of the distribution. For instance, if we assume that the distribution of the data is Gaussian, the hypothesis testing relies on mean or on variance.&lt;br /&gt;&lt;br /&gt;We handle univariate test in this tutorial i.e. we have only one variable of interest. When we want to analyze simultaneously several variables, we talk about multivariate test.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: t-test, F-Test, Bartlett's test, Levene's test, Brown-Forsythe's test, independent samples, dependent samples, paired samples, matched-pairs samples, anova, welch's anova, randomized complete blocks&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Components&lt;/span&gt;: MORE UNIVARIATE CONT STAT, NORMALITY TEST, T-TEST, T-TEST UNEQUAL VARIANCE, ONE-WAY ANOVA, WELCH ANOVA, FISHER’S TEST, BARTLETT’S TEST, LEVENE’S TEST, BROWN-FORSYTHE TEST, PAIRED T-TEST, PAIRED V-TEST, ANOVA RANDOMIZED BLOCKS&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Univariate_Parametric_Tests.pdf" target="_blank"&gt;en_Tanagra_Univariate_Parametric_Tests.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/credit_approval.xls" target="_blank"&gt;credit_approval.xls&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;NIST/SEMATECH e-Handbook of Statistical Methods, &lt;a href="http://www.itl.nist.gov/div898/handbook/" target="_blank"&gt;http://www.itl.nist.gov/div898/handbook/&lt;/a&gt; (Chapter 7, Product and Process Comparisons)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8929581468242970928?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8929581468242970928'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8929581468242970928'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/11/parametric-tests-for-comparing.html' title='Parametric tests for comparing populations'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7075276611794896607</id><published>2009-11-26T10:46:00.000-08:00</published><updated>2009-11-26T10:54:20.593-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Three curves for classifier assessment</title><content type='html'>&lt;div&gt;Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier. On one hand we have the confusion matrix and associated indicators, very popular into the academic publications. On the other hand, in the real applications, the users prefers some curves which seem very mysterious for people outside the domain (e.g. &lt;span class="Apple-style-span"  style="color:#33CC00;"&gt;ROC curve&lt;/span&gt; for the epidemiologists, &lt;span class="Apple-style-span"  style="color:#33CC00;"&gt;gain chart or cumulative lift curve&lt;/span&gt; in the marketing domain, &lt;span class="Apple-style-span"  style="color:#33CC00;"&gt;precision recall&lt;/span&gt; curve in the information retrieval domain, etc.).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In this tutorial, we give first the details of the calculation of these curves by creating them "at the hand" in a spreadsheet. Then, we use &lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;b&gt;Tanagra 1.4.33&lt;/b&gt;&lt;/span&gt; and &lt;b&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;R 2.9.2&lt;/span&gt;&lt;/b&gt; for obtaining them. We use these curves for the comparison the performances of the logistic regression and support vector machine (Radial Basis Function kernel).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Keywords&lt;/b&gt;: roc curve, gain chart, precision recall curve, lift curve, logistic regression, support vector machine, svm, radial basis function kernel, rbf kernel, e1071 package, R software, glm&lt;/div&gt;&lt;div&gt;&lt;b&gt;Components&lt;/b&gt;: DISCRETE SELECT EXAMPLES, BINARY LOGISTIC REGRESSION, SCORING, C-SVC, ROC CURVE, LIFT CURVE, PRECISION-RECALL CURVE&lt;br /&gt;&lt;b&gt;Tutorial&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Spv_Learning_Curves.pdf" target="_blank"&gt;en_Tanagra_Spv_Learning_Curves.pdf&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Dataset&lt;/b&gt; : &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart_disease_for_curves.zip" target="_blank"&gt;heart_disease_for_curves.zip&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7075276611794896607?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7075276611794896607'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7075276611794896607'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/11/three-curves-for-classifier-assessment.html' title='Three curves for classifier assessment'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8604658843726329564</id><published>2009-11-21T20:08:00.000-08:00</published><updated>2009-11-21T20:09:21.128-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.34</title><content type='html'>A component of induction of predictive rules (RULE INDUCTION) was added under "Supervised Learning" tab. Its use is described in a tutorial available online (will be translated soon).&lt;br /&gt;&lt;br /&gt;The DECISION LIST component has been improved, we changed the test done during the pre-pruning process. The formula is described in the tutorial above.&lt;br /&gt;&lt;br /&gt;The SAMPLING and STRATIFIED SAMPLING components (Instance Selection tab) have been slightly modified. It is now possible to set ourself the seed number of the pseudorandom number generator.&lt;br /&gt;&lt;br /&gt;Following an indication of Anne Viallefont, calculation of degrees of freedom in tests on contingency tables is now more generic. Indeed, the calculation was wrong when the database was filtered and some margins (row or column) contained a number equal to zero. Anne, thank you for this information. More generally, thank you to everyone who sent me comments. Programming has always been for me a kind of leisure. The real work starts when it is necessary to check the results, compare them with the available references, cross them with other data mining tools, free or not, understand the possible differences, etc.. At this step, your help is really valuable.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8604658843726329564?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8604658843726329564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8604658843726329564'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/11/tanagra-version-1434.html' title='Tanagra - Version 1.4.34'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4094907759355781758</id><published>2009-11-09T06:23:00.001-08:00</published><updated>2009-11-09T06:33:35.709-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Handling Missing values in SIPINA</title><content type='html'>Dealing with missing values is a difficult problem. The programming in itself is not a problem; we just report the missing value by a specific code. In contrast, the treatment before or during data analysis is very complicated.&lt;br /&gt;&lt;br /&gt;Various techniques are available in order to handle missing values into SIPINA. In this tutorial, we show how to implement them; and what are their consequences on the decision tree learning context (C4.5 algorithm; Quinlan, 1993).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Keywords&lt;/b&gt;: missing value, missing data, listwise deletion, casewise deletion, data imputation, C4.5, decision tree&lt;br /&gt;&lt;b&gt;Tutorial&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/doc/en_Sipina_Missing_Data.pdf" target="_blank"&gt;en_Sipina_Missing_Data.pdf&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Dataset&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/dataset/ronflement_missing_data.zip" target="_blank"&gt;ronflement_missing_data.zip&lt;/a&gt;&lt;br /&gt;&lt;b&gt;References&lt;/b&gt;:&lt;br /&gt;P.D. Allison, « Missing Data », in Quantitative Applications in the Social Sciences Series n°136, Sage University Paper, 2002.&lt;br /&gt;J. Bernier, D. Haziza, K. Nobrega, P. Whitridge, « &lt;a href="http://www.ssc.ca/documents/case_studies/2002/missing_e.html" target="_blank"&gt;Handling Missing Data – Case Study &lt;/a&gt;», Statistical Society of Canada.&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/missing.htm" target="_blank"&gt;Data Imputation for Missing Values&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4094907759355781758?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4094907759355781758'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4094907759355781758'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/11/handling-missing-values-in-sipina.html' title='Handling Missing values in SIPINA'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2118397077776989554</id><published>2009-11-04T06:15:00.001-08:00</published><updated>2009-11-04T06:24:52.360-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Model deployment with Sipina</title><content type='html'>Model deployment is the last step of the Data Mining process. In its simplest form in a supervised learning task, it consists in to apply a predictive model on unlabeled cases. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Applying the model on unseen cases is a very useful functionality. But it would be even more interesting if we could announce its accuracy. Indeed, a misclassification can have dramatic consequences. We must measure the risk we take when we make decisions from a predictive model. An indication about the performance of a classifier is important when we decide or not to deploy it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In this tutorial, we show how to apply a classifier on unlabeled sample with Sipina. We show also how to estimate the generalization error rate using a resampling scheme such as bootstrap. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Keywords&lt;/b&gt;: model deployment, unseen cases, unlabeled instances, decision tree, sipina, linear discriminant analysis&lt;/div&gt;&lt;div&gt;&lt;b&gt;Tutorial&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/doc/en_sipina_deployment.pdf" target="_blank"&gt;en_sipina_deployment.pdf&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Dataset&lt;/b&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/dataset/wine_deployment.xls" target="_blank"&gt;wine_deployment.xls&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;References&lt;/b&gt;:&lt;/div&gt;&lt;div&gt;Tanagra Tutorials, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/apply-classifier-on-new-dataset.html"&gt;Applying a classifier on a new dataset (Deployment)&lt;/a&gt;"&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2118397077776989554?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2118397077776989554'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2118397077776989554'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/11/model-deployment-with-sipina.html' title='Model deployment with Sipina'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6890770492290342043</id><published>2009-11-02T22:49:00.000-08:00</published><updated>2009-11-02T22:58:31.312-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Sipina - Supported file format</title><content type='html'>The data access is the first step of the data mining process. It is a crucial step. It is one of the main criteria used when we want to assess the quality of a tool. If we do not able to load a dataset, we cannot perform any kind of analysis. The software is not useable. If the data access is not easy and requires complicated operations, we will devote less time to the other steps of the data exploration.&lt;br /&gt;&lt;br /&gt;The first goal of this tutorial is to describe the various file formats that are supported in Sipina. Some of the solutions are more deeply described in other tutorials elsewhere; we indicate the appropriate reference in these cases. The second goal is to describe the behavior of these formats when we handle a large dataset with &lt;span style="color: rgb(51, 51, 255); font-weight: bold;"&gt;4,817,099 instances&lt;/span&gt; and &lt;span style="color: rgb(51, 51, 255); font-weight: bold;"&gt;42 variables&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Last, we learn a decision tree on this dataset in order to evaluate the behavior of Sipina when we process a large data file.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: file format, data file importation, decision tree, large dataset, csv, arff, fdm, fdz, zdm&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/en_Sipina_File_Format.pdf" target="_blank"&gt;en_Sipina_File_Format.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/weather.txt" target="_blank"&gt;weather.txt&lt;/a&gt; and &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/kdd-cup-discretized-descriptors.txt.zip" target="_blank"&gt;kdd-cup-discretized-descriptors.txt.zip&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6890770492290342043?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6890770492290342043'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6890770492290342043'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/11/sipina-supported-file-format.html' title='Sipina - Supported file format'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7169390756642272102</id><published>2009-10-30T22:16:00.001-07:00</published><updated>2009-10-30T22:28:00.756-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Importing Weka file (.arff) into Sipina</title><content type='html'>&lt;div&gt;&lt;/div&gt;WEKA is a very popular Data Mining tool. It supplies a very large of machine learning methods. WEKA can handle various files. But it has a native format (.ARFF) which is a text file with additional specifications.&lt;br /&gt;&lt;br /&gt;The text file format is very simple and very easy to manipulate. But, on the other hand, the processing of this kind of file is often slow, slower than binary file format. When we deal with a moderate size file, the text file is enough efficient. The differences between the time processing are not discernible.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to import the ARFF file format into Sipina. We subdivide the dataset into train and test samples. Then we learn and we assess a decision tree.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Keywords&lt;/span&gt;: decision tree, c4.5, file format, data file importation, weka, arff&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tutorial&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/en_sipina_weka_file_format.pdf" target="_blank"&gt;en_sipina_weka_file_format.pdf&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dataset&lt;/span&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/dataset/ionosphere.arff" target="_blank"&gt;ionosphere.arff&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;:&lt;br /&gt;M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I. Witten, "The Weka Data Mining Software: An Update", SIGKDD Explorations, Vol. 11, Issue 1, 2009.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7169390756642272102?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7169390756642272102'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7169390756642272102'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/10/importing-weka-file-arff-into-sipina.html' title='Importing Weka file (.arff) into Sipina'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1532682281200943046</id><published>2009-10-28T06:27:00.000-07:00</published><updated>2009-11-01T15:32:21.241-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Local sampling for decision tree learning</title><content type='html'>During the decision tree learning process, the algorithm detects the better variable according to a goodness of fit measure when it tries to split a node. The calculation can take a long time, particularly when it deals with a continuous descriptors for which it must detect the optimal cut point.&lt;br /&gt;&lt;br /&gt;For all the decision tree algorithms, &lt;b&gt;Sipina&lt;/b&gt; can use a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes.&lt;br /&gt;&lt;br /&gt;We have described this approach in a paper (Chauchat and Rakotomalala, IFCS-2000) . We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node enables to reduce dramatically the execution time without loss of accuracy.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;We use a version of the WAVEFORM dataset with &lt;span style="color: rgb(0, 153, 0);"&gt;21 continuous descriptors&lt;/span&gt; and &lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;2,000,000 instances&lt;/span&gt;. We obtain the tree in &lt;span style="color: rgb(51, 51, 255); font-weight: bold;"&gt;3 seconds&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;on our computer.&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;strong&gt;Keywords&lt;/strong&gt; : decision tree, sampling, large dataset&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt; : SAMPLING, ID3, TEST&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt; : &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Sipina_Sampling.pdf" target="_blank"&gt;en_Sipina_Sampling.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset &lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/wave2M.zip" target="_blank"&gt;wave2M.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Références&lt;/strong&gt; :&lt;br /&gt;J.H. Chauchat, R. Rakotomalala, « &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/doc/chauchat_rakotomalala_ifcs2000.pdf" target="_blank"&gt;A new sampling strategy for building decision trees from large databases &lt;/a&gt;», Proc. of IFCS-2000, pp. 199-204, 2000.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1532682281200943046?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1532682281200943046'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1532682281200943046'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/10/local-sampling-approach-for-decision.html' title='Local sampling for decision tree learning'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6510599570100452862</id><published>2009-10-02T21:53:00.000-07:00</published><updated>2009-10-02T21:54:45.141-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.33</title><content type='html'>Several logistic regression diagnostics and evaluation tools were implemented, one of them (reliability diagram) can be applied to any supervised method&lt;br /&gt;&lt;br /&gt;1.The estimated covariance matrix&lt;br /&gt;2. Hosmer - Lemeshow Test&lt;br /&gt;3. Reliability diagram (says also calibration plot)&lt;br /&gt;4. Analysis of residuals, outilers and influentials points (pearson residuals, deviance residuals, dfichisq, difdev, levier, Cook's distance, dfbeta, dfbetas)&lt;br /&gt;&lt;br /&gt;A tutorial describing the utilization of these tools will be available soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6510599570100452862?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6510599570100452862'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6510599570100452862'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/10/tanagra-version-1433.html' title='Tanagra - Version 1.4.33'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5009643938825691974</id><published>2009-09-28T05:26:00.000-07:00</published><updated>2009-09-28T05:32:22.847-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Using batch mode for Tanagra</title><content type='html'>For large simulations, it is more convenient to use BATCH mode capabilities of Tanagra rather than opening interactive session. This is the case for instance when we compare the performance of various algorithms on the same dataset; when we try to find automatically the best parameters for a learning method; when we repeat the same treatment on different datasets, etc. In these contexts, it is more useful to save the diagrams in text mode (.TDM file format). It will be easier to handle it outside TANAGRA, with a text editor for instance.&lt;br /&gt;&lt;br /&gt;In this tutorial, we want to compare the performances of the naïve bayes classifier with and without the feature selection process. We know that the naïve bayes classifier is highly sensitive to irrelevant features. The goal of this tutorial is to evaluate the efficiency of the FCBF feature selection method in this context.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; batch mode, supervised learning, naive bayes, feature selection, experiments&lt;br /&gt;&lt;strong&gt;Components:&lt;/strong&gt; NAIVE BAYES, FCBF, CROSS VALIDATION&lt;br /&gt;&lt;strong&gt;Tutorial: &lt;/strong&gt;&lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/english_dr_utiliser_tanagra_en_mode_batch.pdf" target="_blank"&gt;english_dr_utiliser_tanagra_en_mode_batch.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/tanagra_batch_execution.zip" target="_blank"&gt;tanagra_batch_execution.zip&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5009643938825691974?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5009643938825691974'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5009643938825691974'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/09/using-batch-mode-for-tanagra.html' title='Using batch mode for Tanagra'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4133764444418670009</id><published>2009-07-14T21:11:00.000-07:00</published><updated>2009-07-15T00:19:13.874-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Nonparametric tests for groups comparison - Independent samples - Differences in location</title><content type='html'>The aim of homogeneity test (or test for difference between groups) is to check if K (K &gt;= 2) samples are drawn from the same population according to a variable of interest. In another words, we check if the probability distribution is the same in each sample.&lt;br /&gt;&lt;br /&gt;The nonparametric tests make no assumptions about the distribution of the data. They are called also "distribution free" tests.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to implement nonparametric homogeneity tests for differences in location for K = 2 populations i.e. the distributions of the populations are the same excepting a shift in location (central tendency). The Kolmogorov-Smirnov test is the more general one. It checks all kind of differences between the cumulative distribution functions (CDF). Afterwards, we can implement other tests which characterize more deeply the difference. The Wilcoxon-Mann-Whitney test is certainly the most popular one. We will see in this tutorial that other tests can be also implemented.&lt;br /&gt;&lt;br /&gt;Some the tests introduced here are usable when the number of groups is upper than 2 (K &gt; 2).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: nonparametric test, Kolmogorov-Smirnov test, Wilcoxon-Mann-Whitney test, Van der Waerden test, Fisher-Yates-Terry-Hoeffding test, median test, location model&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: FYTH 1-WAY ANOVA, K-S 2-SAMPLE TEST, MANN-WHITNEY COMPARISON, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_Nonparametric_Test_MW_and_related.pdf" target="_blank"&gt;en_Tanagra_Nonparametric_Test_MW_and_related.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;:&lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/machine_packs_cartons.xls" target="_blank"&gt; machine_packs_cartons.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;R. Rakotomalala, « &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/cours/cours/Comp_Pop_Tests_Nonparametriques.pdf" target="_blank"&gt;Comparaison de populations. Tests non paramétriques&lt;/a&gt; », Université Lyon 2 (in french).&lt;br /&gt;Wikipedia, « &lt;a href="http://en.wikipedia.org/wiki/Nonparametric" target="_blank"&gt;Non-parametric statistics&lt;/a&gt; ».&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4133764444418670009?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4133764444418670009'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4133764444418670009'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/07/nonparametric-test-for-group.html' title='Nonparametric tests for groups comparison - Independent samples - Differences in location'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3576664280342751216</id><published>2009-07-09T04:10:00.000-07:00</published><updated>2009-07-09T04:17:04.952-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Resampling methods for error estimation</title><content type='html'>The ability to predict correctly is one of the most important criteria to evaluate classifiers in supervised learning. The preferred indicator is the error rate (1 - accuracy rate). It states the probability of misclassification of a classifier. In most cases we do not know the true error rate because we do not have the whole population and we do not know the probability distribution of the data. So we need to compute estimation from the available dataset.&lt;br /&gt;&lt;br /&gt;In the small sample context, it is preferable to implement the resampling approaches for error rate estimation. In this tutorial, we study the behavior of the cross validation (cv), leave one out (lvo) and bootstrap (boot). All of them are based on the repeated train-test process, but in different configurations. We keep in mind that the aim is to evaluate the error rate of the classifier created on the whole sample. Thus, the intermediate classifiers computed on each learning session are not really interesting. This is the reason for which they are rarely provided by the data mining tools.&lt;br /&gt;&lt;br /&gt;The main supervised learning method used is the linear discriminant analysis (LDA). We will see at the end of this tutorial that the behavior observed for this learning approach is not the same if we use another approach such as a decision tree learner (C4.5).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: resampling, generalization error rate, cross validation, bootstrap, leave one out, linear discriminant analysis, C4.5&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: Supervised Learning, Cross-validation, Bootstrap, Test, Leave-one-out, Linear discriminant analysis, C4.5&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Resampling_Error_Estimation.pdf" target="_blank"&gt;en_Tanagra_Resampling_Error_Estimation.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/wave_ab_err_rate.zip" target="_blank"&gt;wave_ab_err_rate.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;:&lt;br /&gt;"&lt;a href="http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html" target="_blank"&gt;What are cross validation and bootstrapping?&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3576664280342751216?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3576664280342751216'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3576664280342751216'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/07/resampling-methods-for-error-estimation.html' title='Resampling methods for error estimation'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-868264666218776416</id><published>2009-07-04T22:00:00.000-07:00</published><updated>2009-07-04T22:09:55.220-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Implementing SVM on large dataset</title><content type='html'>Support vector machines (SVM) are a set of related supervised learning methods used for classification and regression. Our aim is to compare various free implementation of SVM, in terms of accuracy and computation time. Indeed, because the heuristic nature of the algorithm, we can obtain different results according to the used tools on the same dataset. In fact, in the publications describing the performance of SVM, we should not only specify the parameters of the algorithm but also indicate what is the tool used. This latter can influence the results.&lt;br /&gt;&lt;br /&gt;SVM is effective in domains with very high number of predictive variables, when the ratio between the number of variables and the number of observations is unfavorable. We are in a domain which is particularly favorable to SVM in this tutorial. We want to discriminate two families of proteins from their description with amino acids. We use sequence of 4 characters (4-grams) as descriptors. Thus, we can have &lt;span style="color:#009900;"&gt;a large number of descriptors (31,809)&lt;/span&gt; in comparison to the number of examples (&lt;span style="color:#009900;"&gt;135 instances&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;We compare &lt;span style="color:#3366ff;"&gt;&lt;strong&gt;Tanagra&lt;/strong&gt;&lt;/span&gt; 1.4.27, &lt;span style="color:#3333ff;"&gt;&lt;strong&gt;Orange&lt;/strong&gt;&lt;/span&gt; 1.0b2, &lt;strong&gt;&lt;span style="color:#3333ff;"&gt;Rapidminer&lt;/span&gt;&lt;/strong&gt; Community Edition 4.2 and &lt;strong&gt;&lt;span style="color:#3333ff;"&gt;Weka&lt;/span&gt;&lt;/strong&gt; 3.5.6.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: svm, support vector machine&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: C-SVC, SVM, SUPERVISED LEARNING, CROSS-VALIDATION&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Perfs_Comp_SVM.pdf" target="_blank"&gt;en_Tanagra_Perfs_Comp_SVM.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/wide_protein_classification.zip" target="_blank"&gt;wide_protein_classification.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;:&lt;br /&gt;Wikipedia (en), « &lt;a href="http://en.wikipedia.org/wiki/Support_vector_machine" target="_blank"&gt;Support vector machine&lt;/a&gt; »&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-868264666218776416?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/868264666218776416'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/868264666218776416'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/07/implementing-svm-on-large-dataset.html' title='Implementing SVM on large dataset'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3728698017590737050</id><published>2009-07-01T04:36:00.000-07:00</published><updated>2009-07-01T04:42:33.489-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Clustering'/><title type='text'>Self-organizing map (SOM)</title><content type='html'>A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to implement the Kohonen's SOM algorithm with Tanagra. We try to assess the properties of this approach by comparing the results with those of the PCA algorithm. Then, we compare the results to those of K-Means, which is a clustering algorithm. Finally, we implement the Two-step Clustering process by combining the SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the &lt;a href="http://data-mining-tutorials.blogspot.com/2009/06/two-step-clustering-for-handling-large.html"&gt;Two-Step Clustering&lt;/a&gt; where we combine K-Means and HAC. We observe that the HAC primarily merges the adjacent cells.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: Kohonen, self organizing map, SOM, clustering, dimensuionality reduction, k-means, hierarchical agglomerative clustering, hac, two-step clustering&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: UNIVARIATE CONTINUOUS STAT, UNIVARIATE OUTLIER DETECTION, KOHONEN-SOM, PRINCIPAL COMPONENT ANALYSIS, SCATTERPLOT, K-MEANS, CONTINGENCY CHI-SQUARE, HAC&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Kohonen_SOM.pdf" target="_blank"&gt;en_Tanagra_Kohonen_SOM.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/waveform_unsupervised.xls" target="_blank"&gt;waveform_unsupervised.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;:&lt;br /&gt;Wikipedia, « Self organizing map », &lt;a href="http://en.wikipedia.org/wiki/Self-organizing_map" target="_blank"&gt;http://en.wikipedia.org/wiki/Self-organizing_map&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3728698017590737050?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3728698017590737050'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3728698017590737050'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/07/self-organizing-map-som.html' title='Self-organizing map (SOM)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1893398724464907844</id><published>2009-06-29T03:54:00.000-07:00</published><updated>2009-06-29T03:57:24.374-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Univariate outlier detection methods</title><content type='html'>The detection and the treatment of outliers (individuals with unusual values) is an important task of data preparation. Unusual values can mislead results of subsequent data analysis.&lt;br /&gt;&lt;br /&gt;Outliers can be detected on one variable (a man with 158 years old) or on a combination of variables (a boy with 12 years old crosses the 100 yards in 10 seconds). In this tutorial, we show how to use the UNIVARIATE OUTLIER DETECTION component. It is intended to univariate detection of outliers i.e. taking into account individually the variables.&lt;br /&gt;&lt;br /&gt;The approaches implemented in the component come from the NIST website (see reference). We use also an additional rule based on the x-sigma deviation from the mean of the variable.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: outlier, influential point&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: MORE UNIVARIATE CONT STAT, SCATTERPLOT WITH LABEL, UNIVARIATE OUTLIER DETECTION, UNIVARIATE CONT STAT&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Outliers_Detection.pdf" target="_blank"&gt;en_Tanagra_Outliers_Detection.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/body_mass_index.xls" target="_blank"&gt;body_mass_index.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;NIST/SEMATECH, « e-Handbook of Statistical Methods », Section 7.1.6, « &lt;a href="http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm" target="_blank"&gt;What are outliers in the data ? &lt;/a&gt;»&lt;br /&gt;R. High, "&lt;a href="http://cc.uoregon.edu/cnews/spring2000/outliers.html" target="_blank"&gt;Dealing with 'Outliers': How to Maintain Your Data's Integrity&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1893398724464907844?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1893398724464907844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1893398724464907844'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/06/univariate-outlier-detection-methods.html' title='Univariate outlier detection methods'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6752361841708594803</id><published>2009-06-28T23:55:00.000-07:00</published><updated>2009-06-29T00:11:58.782-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Diagram management'/><title type='text'>Copy paste feature into the diagram</title><content type='html'>&lt;p&gt;When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the diagram. This feature is very helpful when we have to repeat sequences of treatments in different parts of the diagram. The settings are also duplicated.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;In this tutorial, we show how to copy a component or a branch. We will see that this feature is helpful when, for instance, we deal with the performance comparisons of supervised learning algorithms on the same dataset. In this context, the processing sequence is always the same, only the method that we want to evaluate is different.&lt;/p&gt;&lt;p&gt;We work on the same project here. We cannot copy paste components between two opened projects. But, in another tutorial, we show &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/saving-and-loading-sub-diagram.html"&gt;how to save a part of the diagram in an external file&lt;/a&gt;. Thus, the same processing sequence can be applied on multiple datasets.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: copy paste, diagram management, comparison of classifiers, supervised learning, cross validation, dimensionality reduction&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: Supervised learning, Binary logistic regression, C-PLS, C-SVC, Linear discriminant analysis, K-NN, Principal Component Analysis&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Diagram_New_Features.pdf" target="_blank"&gt;en_Tanagra_Diagram_New_Features.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/sonar.xls" target="_blank"&gt;sonar.xls&lt;/a&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6752361841708594803?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6752361841708594803'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6752361841708594803'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/06/copy-paste-feature-into-diagram.html' title='Copy paste feature into the diagram'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2122614456029825949</id><published>2009-06-27T01:04:00.000-07:00</published><updated>2009-06-27T01:09:21.552-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><title type='text'>The A PRIORI MR component</title><content type='html'>Association rule learning is a popular method for discovering interesting relations between variables in large databases. It was often used in market basket analysis domain e.g. if a customer buys onions and potatoes then he buys also beef. But, in fact, it can be implemented in various application areas where we want to discover the association between variables.&lt;br /&gt;&lt;br /&gt;We were already described the association rule mining tools of Tanagra in &lt;a href="http://data-mining-tutorials.blogspot.com/search/label/Association%20rules"&gt;several tutorials&lt;/a&gt;. The A PRIORI approach is certainly the most popular. But, despite its good properties, this method has a drawback: the number of obtained rules can be very high. The ability to underline the most interesting rules, those which are relevant, becomes a major challenge.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show to implement the A PRIORI MR component. It differentiates oneself from other by offering additional tools for exploring and assessing the mined rules: original measures based on the “test value” principle allow to evaluate differently the rules; the ability to copy the results into a spreadsheet allows a more detailed exploration of the rule base; by subdividing the dataset into train and test sets, we obtain a more reliable values of the interestingness measures of rules.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: association rule, a priori algorithm, interestingness measure, test value principle&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: A PRIORI MR&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_APrioriMR_Component.pdf" target="_blank"&gt;en_Tanagra_APrioriMR_Component.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/credit_assoc.xls" target="_blank"&gt;credit_assoc.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;:&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning" target="_blank"&gt;Association rule learning&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2122614456029825949?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2122614456029825949'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2122614456029825949'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/06/a-priori-mr-component.html' title='The A PRIORI MR component'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5086959615464697845</id><published>2009-06-13T23:05:00.000-07:00</published><updated>2009-06-16T23:44:08.290-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Clustering'/><title type='text'>Two-step clustering for handling large databases</title><content type='html'>The aim of the clustering is to identify homogenous subgroups of instance in a population. In this tutorial, we implement a two-step clustering algorithm which is well-suited when we deal with a large dataset. It combines the ability of the K-Means clustering to handle a very large dataset, and the ability of the Hierarchical clustering (HCA – Hierarchical Cluster Analysis) to give a visual presentation of the results called “dendrogram”. This one describes the clustering process, starting from unrefined clusters, until the whole dataset belongs to one cluster. It is especially helpful when we want to detect the appropriate number of clusters.&lt;br /&gt;&lt;br /&gt;The implementation of the two-step clustering (called also “&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/hac-and-hybrid-clustering.html"&gt;Hybrid Clustering&lt;/a&gt;”) under Tanagra is already described elsewhere. According to the Lebart and al. (2000) recommendation , we perform the clustering algorithm on the latent variables supplied by a PCA (Principal Component Analysis) computed from the original variables. This pre-treatment cleans the dataset by removing the irrelevant information such as noise, etc. In this tutorial, we show the efficiency of the approach on a large dataset with &lt;strong&gt;&lt;span style="color: rgb(51, 204, 0);"&gt;500,000 observations and 68 variables&lt;/span&gt;&lt;/strong&gt;. We use &lt;span style="color: rgb(153, 0, 0);"&gt;Tanagra 1.4.27&lt;/span&gt; and &lt;span style="color: rgb(153, 0, 0);"&gt;R 2.7.2&lt;/span&gt; which are the only tools which allow to implement easily the whole process.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: clustering, hierarchical cluster analysis, HCA, k-means, principal component analysis, PCA&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, HAC, GROUP CHARACTERIZATION, EXPORT DATASET&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/en_Tanagra_CAH_Mixte_Gros_Volumes.pdf" target="_blank"&gt;en_Tanagra_CAH_Mixte_Gros_Volumes.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/%7Ericco/tanagra/fichiers/sample-census.zip" target="_blank"&gt;sample-census.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;L. Lebart, A. Morineau, M. Piron, « Statistique Exploratoire Multidimensionnelle », Dunod, 2000 ; chapter 2, sections 2.3 et 2.4.&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm" target="_blank"&gt;Cluster Analysis&lt;/a&gt;" from North Carolina State University.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5086959615464697845?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5086959615464697845'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5086959615464697845'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/06/two-step-clustering-for-handling-large.html' title='Two-step clustering for handling large databases'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5854170100391175797</id><published>2009-06-11T05:11:00.000-07:00</published><updated>2009-06-11T05:18:20.560-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Clustering'/><title type='text'>K-Means - Comparison of free tools</title><content type='html'>K-means is a clustering (unsupervised learning) algorithm. The aim is to create homogeneous subgroups of examples. The individuals in the same subgroup are similar; the individuals in different subgroups are as different as possible.&lt;br /&gt;&lt;br /&gt;The K-Means approach is already described in several tutorials (&lt;a href="http://data-mining-tutorials.blogspot.com/search?q=k-means"&gt;http://data-mining-tutorials.blogspot.com/search?q=k-means&lt;/a&gt;). The goal here is to compare its implementation with various free tools. We study the following tools: Tanagra 1.4.28; &lt;a href="http://www.r-project.org/" target="_blank"&gt;R 2.7.2&lt;/a&gt; without additional package; &lt;a href="http://www.knime.org/" target="_blank"&gt;Knime 1.3.5&lt;/a&gt;; &lt;a href="http://www.ailab.si/Orange/" target="_blank"&gt;Orange 1.0b2&lt;/a&gt; and &lt;a href="http://rapid-i.com/content/blogcategory/38/69/" target="_blank"&gt;RapidMiner&lt;/a&gt; Community Edition.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: clustering, k-means, PCA, principal component analysis, MDS,multidimensional scaling&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: PRINCIPAL COMPONENT ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, EXPORT DATASET&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_et_les_autres_KMeans.pdf" target="_blank"&gt;en_Tanagra_et_les_autres_KMeans.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/cars_dataset.zip" target="_blank"&gt;cars_dataset.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;:&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm" target="_blank"&gt;Cluster Analysis&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5854170100391175797?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5854170100391175797'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5854170100391175797'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/06/k-means-comparison-of-free-tools.html' title='K-Means - Comparison of free tools'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8326547432999274226</id><published>2009-05-29T21:21:00.000-07:00</published><updated>2009-05-29T21:28:36.718-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='Exploratory Data Analysis'/><title type='text'>Understanding the "test value" criterion</title><content type='html'>The test value (VT) is a criterion often used in various components of TANAGRA. It is mainly used for the characterization of a group of observations according a continuous or categorical variable. The groups may be defined by categories from a discrete variable; they can also be computed by a machine learning algorithm (e.g. clustering, a node of a decision tree, etc.).&lt;br /&gt;&lt;br /&gt;The principle is elementary: we compare the values of a descriptive statistic indicator computed on the whole sample and computed on sub sample related to the group. For a continuous variable, we compare the mean; for a discrete one, we compare the proportion.&lt;br /&gt;&lt;br /&gt;Despite, or because of its simplicity, the VT is very useful. The formulation that we present in this tutorial is taken from the Lebart et al.’s book (2001). The VT is intensively used in some commercial software such as SPAD (&lt;a href="http://eng.spad.eu/"&gt;http://eng.spad.eu/&lt;/a&gt;). It allows to characterize groups, but it can be used also to strengthen the interpretation of the factors extracted from a factorial analysis process.&lt;br /&gt;&lt;br /&gt;In this tutorial, we emphasis the formulas used for both categorical and continuous variables. We put them in connection with the results provided by the GROUP CHARACTERIZATION component of TANAGRA.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: test value, group characterization, clustering, factorial analysis&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: Group characterization&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Comprendre_La_Valeur_Test.pdf" target="_blank"&gt;en_Tanagra_Comprendre_La_Valeur_Test.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart_disease_male.xls" target="_blank"&gt;heart_disease_male.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;:&lt;br /&gt;L. Lebart, A. Morineau, M. Piron, « Statistique exploratoire multidimensionnelle », Dunod, 2000 ; pages 181 to 184.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8326547432999274226?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8326547432999274226'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8326547432999274226'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/05/understanding-test-value-criterion.html' title='Understanding the &quot;test value&quot; criterion'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7670428229899790262</id><published>2009-05-29T02:37:00.000-07:00</published><updated>2009-05-29T02:51:19.112-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Descriptive statistics (continued)</title><content type='html'>The aim of descriptive statistics is to describe the main features of a collection of data in quantitative terms . The visualization of the whole data table is seldom useful. It is preferable to summarize the characteristics of the data with some selected numerical indicators.&lt;br /&gt;&lt;br /&gt;In this tutorial, we distinguish two kinds of descriptive approaches: the univariate tools which summarize the characteristics of a variable individually; the bivariate tools which characterize the association between two variables. According to the type of the variables (categorical or continuous), we use different indicators.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: descriptive statistics&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: UNIVARIATE DISCRETE STAT, CONTINGENCY CHI-SQUARE, UNIVARIATE CONTINUOUS STAT, SCATTERPLT, LINEAR CORRELATION, GROUP CHARACTERIZATION&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Descriptive_Statistics.pdf" target="_blank"&gt;en_Tanagra_Descriptive_Statistics.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/enquete_satisfaction_femmes_1953.xls" target="_blank"&gt;enquete_satisfaction_femmes_1953.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Tanagra Tutorials, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/descriptive-statistics.html"&gt;Descriptive statistics&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7670428229899790262?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7670428229899790262'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7670428229899790262'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/05/descriptive-statistics-continued.html' title='Descriptive statistics (continued)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8989223456643444641</id><published>2009-05-01T09:34:00.000-07:00</published><updated>2009-05-11T03:18:11.792-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>ID3 on a large dataset</title><content type='html'>In the data mining domain, the increasing size of the dataset is one of the major challenges in the recent years. The ability to handle large data sets is an important criterion to distinguish between research and commercial software.&lt;br /&gt;&lt;br /&gt;Commercial tools have often a very efficient data management systems, limiting the amount of data loaded into memory at each step of the treatment. Research tools, at the opposite, keep all data in memory. The limits are clearly the memory capacity of the machine in this context. It is certainly a drawback for the treatment of large files. We note however that, nowadays, we can have very powerful computers at least cost, this drawback is always postponed. With an appropriate encoding strategy, we can fit in memory all the dataset, even if we handle a large data file.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to import a file with 581,012 observations and 55 variables, and then how to build a decision tree with the ID3 method. In relation to other decision tree algorithm such as C4.5 or CART, the determination of the right size of the tree is based on a pre-pruning rule. We will see that the computation is fast because of this characteristic.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: large dataset, decision tree algorithm, ID3&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: ID3, SPV LEARNING&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Big_Dataset.pdf" target="_blank"&gt;en_Tanagra_Big_Dataset.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/covtype.zip" target="_blank"&gt;covtype.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Tanagra tutorials, "&lt;a href="http://data-mining-tutorials.blogspot.com/2009/01/performance-comparison-under-linux.html"&gt;Performance comparison under Linux&lt;/a&gt;"&lt;br /&gt;Tanagra Tutorials, "&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-large-dataset.html"&gt;Decision tree and large dataset&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8989223456643444641?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8989223456643444641'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8989223456643444641'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/05/id3-on-large-dataset.html' title='ID3 on a large dataset'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5345401340098416990</id><published>2009-04-30T02:41:00.000-07:00</published><updated>2009-04-30T03:00:04.550-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Exploratory Data Analysis'/><title type='text'>Principal Component Analysis (PCA)</title><content type='html'>The PCA belongs to the factor analysis approaches. It is used to discover the underlying structure of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors (dimensions) and as such is a "non-dependent" procedure i.e. it does not assume a dependent variable is specified.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. We use the AUTOS_ACP.XLS dataset from the state-of-the-art SAPORTA’s book. The interest of this dataset is that we can compare our results with those described in the book (pages 177 to 181). We simply show the sequence of operations and the reading of the results tables in this tutorial. About the detailed interpretation, it is best to refer to the book.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: factor analysis, principal component analysis, correlation circle&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: Principal Component Analysis, View Dataset, Scatterplot with labels, View multiple scatterplot&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Acp.pdf" target="_blank"&gt;en_Tanagra_Acp.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/autos_acp.xls" target="_blank"&gt;autos_acp.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;G. Saporta, " Probabilités, Analyse de données et Statistique ", Dunod, 2006 ; pages 177 to 181.&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/factor.htm" target="_blank"&gt;Factor Analysis&lt;/a&gt;".&lt;br /&gt;Statsoft Textbook, "&lt;a href="http://www.statsoft.com/textbook/stfacan.html" target="_blank"&gt;Principal components and factor analysis&lt;/a&gt;".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5345401340098416990?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5345401340098416990'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5345401340098416990'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/04/principal-component-analysis-pca.html' title='Principal Component Analysis (PCA)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5740391324092332432</id><published>2009-04-30T01:16:00.000-07:00</published><updated>2009-04-30T01:25:13.898-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Exploratory Data Analysis'/><title type='text'>Multiple Correspondence Analysis (MCA)</title><content type='html'>The multiple correspondence analysis is a factor analysis approach. It deals with a tabular dataset where a set of examples are described by a set of categorical variables. The aim is to map the dataset in a reduced dimension space (usually two) which allows us to highlight the associations between the examples and the variables. It is useful to understand the underlying structure of a tabular dataset.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. The opportunity to copy/paste the results in a spreadsheet is certainly one of the most interesting functionalities of the software. Indeed, it gives us access to tools (tri, formatted, etc) in a well-known environment of the experts of the data processing. For example, the possibility of sorting the various tables according to the contributions and the COS2 proves really practical when one wishes to interpret the dimensions.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: factor analysis, multiple correspondence analysis&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: Multiple correspondance analysis, View Dataset, Scatterplot with labels, View multiple scatterplot&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Acm.pdf" target="_blank"&gt;en_Tanagra_Acm.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/races_canines_acm.xls" target="_blank"&gt;races_canines_acm.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;M. Tenenhaus, " Méthodes statistiques en gestion ", Dunod, 1996 ; pages 212 to 222 (in French).&lt;br /&gt;Statsoft Inc., "&lt;a href="http://www.statsoft.com/textbook/stcoran.html#multiple" target="_blank"&gt;Multiple Correspondence Analysis&lt;/a&gt;".&lt;br /&gt;D. Garson, "Statnotes - &lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/correspondence.htm" target="_blank"&gt;Correspondence Analysis&lt;/a&gt;".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5740391324092332432?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5740391324092332432'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5740391324092332432'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/04/multiple-correspondence-analysis-mca.html' title='Multiple Correspondence Analysis (MCA)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8226744122661027336</id><published>2009-04-26T01:02:00.000-07:00</published><updated>2009-04-26T01:13:19.663-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Regression analysis'/><title type='text'>Support Vector Regression (SVR)</title><content type='html'>Support Vector Machines (SVM) is a well-know approach in the machine learning community. It is usually implemented for a classification problem in a supervised learning framework. But SVM can be used also in a regression process, where we want to predict or explain the values taken by a continuous predicted attribute. We say Support Vector Regression in this context.&lt;br /&gt;&lt;br /&gt;The method is not widely diffused among statisticians. Yet it combines the qualities that rank it favorably compared with existing techniques. It has a well behavior even if the ratio between the number of variables and the number of observations becomes very unfavorable, with highly correlated predictors. Another advantage is the principle of kernel (the famous "kernel trick"). It is possible to construct a non-linear model without explicitly having to produce new descriptors. A deeply study of the characteristics of the method allows to make comparison with penalized regression such as ridge regression.&lt;br /&gt;&lt;br /&gt;The first subject of this tutorial is to show how to use two new SVR components of the 1.4.31 version of Tanagra. They are based on the famous &lt;a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/" target="_blank"&gt;LIBSVM&lt;/a&gt; library. We use the same library for the classification (see &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/svm-using-libsvm-library.html"&gt;C-SVC component&lt;/a&gt;). We compare our results to those of the &lt;a href="http://cran.r-project.org/" target="_blank"&gt;R software&lt;/a&gt; (version 2.8.0). We utilize the &lt;a href="http://cran.r-project.org/web/packages/e1071/index.html" target="_blank"&gt;e1071&lt;/a&gt; package for R. It is also based on the LIBSVM library.&lt;br /&gt;&lt;br /&gt;The second subject is to propose a new assessment component for the regression. It is usual in the supervised learning framework to split the dataset into two parts, the first for the learning process, the second for its evaluation, in order to obtain an unbiased estimation of the performances. We can implement the same approach for the regression. The procedure is even essential when we try to compare models with various complexities (or various degrees of freedom). We will see in this tutorial that the usual indicators calculated on the learning data are highly misleading in certain situations. We must use an independent test set when we want assess a model.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: support vector regression, support vector machine, regression, linear regression, regression assessment, R software, package e1071&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: MULTIPLE LINEAR REGRESSION, EPSILON SVR, NU SVR, REGRESSION ASSESSMENT&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Support_Vector_Regression.pdf" target="_blank"&gt;en_Tanagra_Support_Vector_Regression.pdf &lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/qsar.zip" target="_blank"&gt;qsar.zip &lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References &lt;/strong&gt;:&lt;br /&gt;C.C. Chang, C.J. Lin, "&lt;a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/" target="_blank"&gt;LIBSVM - A Library for Support Vector Machines&lt;/a&gt;".&lt;br /&gt;S. Gunn, « &lt;a href="http://users.ecs.soton.ac.uk/srg/publications/pdf/SVM.pdf" target="_blank"&gt;Support Vector Machine for Classification and Regression &lt;/a&gt;», Technical Report of the University of Southampton, 1998.&lt;br /&gt;A. Smola, B. Scholkopf, « &lt;a href="http://eprints.pascal-network.org/archive/00002057/01/SmoSch03b.pdf" target="_blank"&gt;A tutorial on Support Vector Regression&lt;/a&gt; », 2003.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8226744122661027336?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8226744122661027336'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8226744122661027336'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/04/support-vector-regression-svr.html' title='Support Vector Regression (SVR)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-3985888956957546758</id><published>2009-04-23T00:34:00.000-07:00</published><updated>2009-04-23T00:42:31.913-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data file handling'/><title type='text'>Launching Tanagra from OOo Calc under Linux</title><content type='html'>The integration of Tanagra into a spreadsheet, such as Excel or Open Office Calc (OOo Calc or OOCalc), is undoubtedly an advantage. Without special knowledge about the database format, the user can handle the dataset into a familiar environment, the spreadsheet, and send it to specialized tools for Data Mining when he want to lead more sophisticated analysis.&lt;br /&gt;&lt;br /&gt;The add-on for OOCalc is initially created for Windows OS. Recently, I have described the installation and the utilization of Tanagra under Linux . The next step is of course the integration of Tanagra into OOCalc under Linux.Mr. Thierry Leiber has realized this work for the 1.4.31 version of Tanagra. He has extended the existing add-on. We can launch Tanagra from OOCalc now, either under Windows and Linux. The add-on was tested under the following configurations: Windows XP + OOCalc 3.0.0; Windows Vista + OOCalc 3.0.1; Ubuntu 8.10 + OOCalc 2.4; Ubuntu 8.1 + OOCalc 3.0.1.&lt;br /&gt;&lt;br /&gt;This document extends a previous tutorial, but we work now under the Linux environment (Ubuntu 8.10). All the screen shots are in French because my OS is in French, but I think the process is the same for Linux with other language configuration.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: open office calc, add-on, principal component analysis, PCA, correlation circle, illustrative variable, linux, ubuntu 8.10 intrepid ibex&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: PRINCIPAL COMPONENT ANALYSIS, CORRELATION SCATTERPLOT&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_OOCalc_under_Linux.pdf" target="_blank"&gt;en_Tanagra_OOCalc_under_Linux.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/cereals.xls" target="_blank"&gt;cereals.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Tanagra, « &lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/ooocalc-file-handling-using-add-in.html"&gt;Connection with Open Office Calc&lt;/a&gt; »&lt;br /&gt;Tanagra, « &lt;a href="http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html"&gt;Tanagra under Linux&lt;/a&gt; »&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-3985888956957546758?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3985888956957546758'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/3985888956957546758'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/04/launching-tanagra-from-oocalc-under.html' title='Launching Tanagra from OOo Calc under Linux'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4955018630663550474</id><published>2009-04-15T01:25:00.000-07:00</published><updated>2009-04-17T20:58:05.146-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Tanagra'/><title type='text'>Tanagra - Version 1.4.31</title><content type='html'>Thierry Leiber has improved the add-on making the connection between Tanagra and Open Office. It is now possible, under Linux, to install the add-on for Open Office and launch Tanagra directly after selecting the data (see the tutorials on installing &lt;a href="http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html"&gt;Tanagra under Linux&lt;/a&gt; and the integration of &lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/ooocalc-file-handling-using-add-in.html"&gt;add-on in Open Office Calc&lt;/a&gt;). Thierry, thank you very much for this contribution which helps the users of Tanagra.&lt;br /&gt;&lt;br /&gt;Following a suggestion of Mr. Laurent Bougrain, the confusion matrix is added to the automatic saving of results in experiments. Thank you to Laurent, and all others, who by their constructive comments helps me upgrade Tanagra in the right direction.&lt;br /&gt;&lt;br /&gt;In addition, two new components for regression using the support vector machine principle (support vector regression) were added: Epsilon-Nu-SVR and SVR. A tutorial shows these methods and compare our results with the R software will be available soon. Tanagra, as with the R package "e1071", are based on the famous &lt;a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/" target="_blank"&gt;LIBSVM&lt;/a&gt; library.&lt;br /&gt;&lt;br /&gt;Tutorials about these releases are coming soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4955018630663550474?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4955018630663550474'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4955018630663550474'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/04/tanagra-version-1431.html' title='Tanagra - Version 1.4.31'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1019262174839808385</id><published>2009-03-18T23:17:00.000-07:00</published><updated>2009-03-18T23:22:42.212-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Cost-sensitive learning - Comparison of tools</title><content type='html'>Everyone agrees that taking into consideration the misclassification costs is an important aspect of the practice of Data Mining. For instance, diagnosing disease for a healthy person does not produce the same consequences as to predict health for an ill person. Yet despite its importance, the topic is seldom addressed, both from a theoretical point of view i.e. how to integrate cost during the evaluation of models (easy) and their construction (a little less easy); and from the practical point of view i.e. how to implement the approach in software.&lt;br /&gt;&lt;br /&gt;Using the misclassification cost during the classifier evaluation is easy. We make a cross-product between the misclassification cost matrix and the confusion matrix. We obtain an "expected misclassification cost" (or an expected gain if we multiply the result by -1). Its interpretation is not very easy. It is mainly used for the comparison of models.&lt;br /&gt;&lt;br /&gt;Handling costs during the learning process is less usual. Several approaches are possible. In this tutorial, we show how to use some components of Tanagra intended to cost-sensitive supervised learning on a real (realistic) dataset. We also programmed the same procedures in the R software (http://www.r-project.org/) to give a better visibility on what is implemented. We compare our results with those of Weka. The algorithm underlying our analysis is a decision tree. According to the software, we use C4.5, CART or J48.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: supervised learning, cost sensitive learning, misclassification cost matrix, decision tree algorithm, Weka 3.5.8, R 2.8.0, rpart package&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Cost_Sensitive_Learning.pdf" target="_blank"&gt;en_Tanagra_Cost_Sensitive_Learning.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/dataset-dm-cup-2007.zip" target="_blank"&gt;dataset-dm-cup-2007.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;J.H. Chauchat, R. Rakotomalala, M. Carloz, C. Pelletier, "&lt;a href="http://www.informatik.uni-freiburg.de/~ml/ecmlpkdd/WS-Proceedings/w10/chauchat_workshop.pdf" target="_blank"&gt;Targeting Customer Groups using Gain and Cost Matrix: a Marketing Application&lt;/a&gt;", PKDD-2001.&lt;br /&gt;"&lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/cost-sensitive-decision-trees.html"&gt;Cost-sensitive Decision Tree&lt;/a&gt;", Tutorials for Sipina.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1019262174839808385?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1019262174839808385'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1019262174839808385'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/03/cost-sensitive-learning-comparison-of.html' title='Cost-sensitive learning - Comparison of tools'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-7877212175111381971</id><published>2009-02-25T22:17:00.000-08:00</published><updated>2009-02-26T19:51:15.724-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><title type='text'>Predictive association rules</title><content type='html'>The algorithms for association rules extraction were originally developed to find the logical relation between variables with the same status. The predictive association rules instead seek to generate association of items that characterize a dependent attribute. We are in a supervised learning framework.&lt;br /&gt;&lt;br /&gt;Basically, the algorithm is not really modified. Exploration is just limited to itemsets that include the dependent variable. The computation time is then reduced. Two components of Tanagra are dedicated to this task; these are SPV ASSOC RULE and SPV ASSOC TREE. They are available in the Association tab.Compared to conventional approaches, the components of Tanagra introduce additional specificity: we have the possibility to specify the class value ("dependent variable = value") that you wish to predict. The interest is to finely set the parameters of the algorithm, directly related to the characteristics of data. This is crucial for example when the prior probabilities of the dependent variable values are very different.&lt;br /&gt;&lt;br /&gt;We had already submitted the component &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/supervised-association-rules.html"&gt;SPV TREE ASSOC&lt;/a&gt; elsewhere. But it was in the context of multivariate characterization of groups of individuals (from a clustering algorithm for instance). We compare it to the GROUP CHARACTERIZATION component. In this tutorial, we will compare the behavior of SPV ASSOC TREE and SPV ASSOC RULE during a prediction task. We will put forward their shared properties, the problems they can handle, and their differences. SPV ASSOC RULE, which supplies original rule interestingness measures ("&lt;a href="http://data-mining-tutorials.blogspot.com/2009/02/interestingness-measures-for.html"&gt;test value" indicator&lt;/a&gt;), has the ability to simplify the rule base.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: predictive association rules, interestingness measure, rule base ranking, rule base simplification&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: SPV ASSOC TREE, SPV ASSOC RULE&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Predictive_AssocRules.pdf" target="_blank"&gt;en_Tanagra_Predictive_AssocRules.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/credit_assoc.xls" target="_blank"&gt;credit_assoc.xls&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-7877212175111381971?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7877212175111381971'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/7877212175111381971'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/02/predictive-association-rules.html' title='Predictive association rules'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-9078262773107578774</id><published>2009-02-22T01:25:00.000-08:00</published><updated>2009-02-25T22:17:01.637-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><title type='text'>Interestingness measures for association rules</title><content type='html'>This document outlines the measures to assess association rules proposed by the A PRIORI MR and SPV ASSOC RULE components. They come from studies reported in several publications of A. Morineau and R. Rakotomalala.&lt;br /&gt;&lt;br /&gt;A measure characterizes the relevance of a rule. It can be used to rank them. It should also help to discern those that are "significantly interesting" from those who are irrelevant. This last point is totally prospective. There is no really satisfactory solution at this time.&lt;br /&gt;&lt;br /&gt;The A PRIORI MR and the SPV ASSOC RULE components are experimental tools for the evaluation of the rules extracted by the association rule induction algorithm. They allow to evaluate the rules using measures based on the test value principle.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: association rules, interestingness measure, test value&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: A PRIORI MR, SPV ASSOC RULE&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_APrioriMR_Measures.pdf" target="_blank"&gt;en_Tanagra_APrioriMR_Measures.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;R. Rakotomalala, A. Morineau, 2008. “The TVpercent principle for the counterexamples statistic”, in &lt;em&gt;Statistical Implicative Analysis&lt;/em&gt;, Studies in Computational Intelligence Series, 127, 449-462, Springer, 2008 -- &lt;a href="http://www.springerlink.com/content/g245317206950529/" target="_blank"&gt;http://www.springerlink.com/content/g245317206950529/&lt;/a&gt;&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning" target="_blank"&gt;Association rule learning&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-9078262773107578774?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/9078262773107578774'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/9078262773107578774'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/02/interestingness-measures-for.html' title='Interestingness measures for association rules'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1238927418646001310</id><published>2009-01-27T10:46:00.000-08:00</published><updated>2009-01-28T20:07:18.197-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Performance comparison under Linux</title><content type='html'>The gain chart is an alternative to confusion matrix for the evaluation of a classifier. Its name is sometimes different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.).&lt;br /&gt;&lt;br /&gt;The main idea is to elaborate a graph where the X coordinates is the percent of the population and the Y coordinates is the percent of the positive value of the class attribute. The gain chart is used mainly in the marketing domain where we want to detect potential customers, but it can be used in other situations.&lt;br /&gt;&lt;br /&gt;The construction of the gain chart is already outlined in a previous tutorial (see &lt;a href="http://data-mining-tutorials.blogspot.com/2008/11/lift-curve-coil-challenge-2000.html"&gt;http://data-mining-tutorials.blogspot.com/2008/11/lift-curve-coil-challenge-2000.html&lt;/a&gt;). In this tutorial, we extend the description to other data mining tools (Knime, RapidMiner, Weka and Orange). The second originality of this tutorial is that we lead the experiment under Linux (French version of Ubuntu 8.10 – see &lt;a href="http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html"&gt;http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html&lt;/a&gt; for the installation and the utilization of Tanagra under Linux). The third originality is that we handle a large dataset with &lt;span style="color:#990000;"&gt;2,000,000 examples and 41 variables&lt;/span&gt;. It will be very interesting to study the behavior of these tools in this configuration, especially because our computer is not really powerful. We note that some tools failed the analysis on the complete dataset.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: scoring, linear discriminant analysis, naive bayes classifier, lift curve, gain chart, cumulative gain chart, knime, rapidminer, weka, orange&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: SAMPLING, LINEAR DISCRIMINANT ANALYSIS, SCORING, LIFT CURVE&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Gain_Chart.pdf" target="_blank"&gt;en_Tanagra_Gain_Chart.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/dataset_gain_chart.zip" target="_blank"&gt;dataset_gain_chart.zip&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1238927418646001310?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1238927418646001310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1238927418646001310'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/01/performance-comparison-under-linux.html' title='Performance comparison under Linux'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2659174898009779145</id><published>2009-01-24T10:17:00.000-08:00</published><updated>2009-01-24T10:26:41.589-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Sipina under Linux</title><content type='html'>In a recent &lt;a href="http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html"&gt;tutorial&lt;/a&gt;, we show that it is possible to work with Tanagra under Linux using Wine. In this document, we implement &lt;a href="http://eric.univ-lyon2.fr/~ricco/sipina" target="_blank"&gt;Sipina&lt;/a&gt; (a data mining software intended to decision tree induction) with the same framework i.e. we install and use Sipina in a Linux environment. We use the Ubuntu distribution (French version 8.10). All the functionalities of Sipina are available, especially the interactive tools which allows us to explore deeply the subpopulation into a node of the tree.&lt;br /&gt;&lt;br /&gt;In this tutorial, we implement the following steps: (1) Installing Sipina under Linux; (2) Launching the software; (3) Loading a dataset (text file with tab separator); (4) Choosing the class attribute and the predictive variables; (5) Partitioning the dataset in a train set and test set; (6) Computing the tree on the train set; (7) Evaluation the tree on the test set e.g. computing the confusion matrix, the error rate, etc.; (8) Exploring a subpopulation related to a node of the tree; (9) Launching a new analysis on a subpopulation related to a node of the tree.&lt;br /&gt;&lt;br /&gt;We will describe quickly the various features of the software in this tutorial. They are already presented in several documents available online (&lt;a href="http://eric.univ-lyon2.fr/~ricco/sipina.html" target="_blank"&gt;http://eric.univ-lyon2.fr/~ricco/sipina.html&lt;/a&gt;, see the DOWNLOAD section). Our main goal here is to show the capabilities of Sipina under Linux.&lt;br /&gt;&lt;br /&gt;We use the French Ubuntu 8.10 distribution; we have installed also Wine, a program which allows to Windows programs to run under Linux.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: linux, ubuntu, wine, sipina, decision tree&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/doc/en_Sipina_under_Linux.pdf" target="_blank"&gt;en_Sipina_under_Linux.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Ubuntu, &lt;a href="http://www.ubuntu.com/" target="_blank"&gt;http://www.ubuntu.com/&lt;/a&gt;&lt;br /&gt;Wine, &lt;a href="https://help.ubuntu.com/community/Wine" target="_blank"&gt;https://help.ubuntu.com/community/Wine&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2659174898009779145?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2659174898009779145'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2659174898009779145'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/01/sipina-under-linux.html' title='Sipina under Linux'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4799220337499273941</id><published>2009-01-12T22:33:00.000-08:00</published><updated>2009-01-12T22:53:05.751-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><title type='text'>Tanagra under Linux</title><content type='html'>The users ask sometimes "Can I use Tanagra under Linux?" The answer is YES and NO.&lt;br /&gt;&lt;br /&gt;NO, we cannot execute natively Tanagra under Linux. It is a 32-bits program for Windows.&lt;br /&gt;&lt;br /&gt;But &lt;strong&gt;&lt;span style="color:#cc0000;"&gt;YES, we can run Tanagra under Linux using WINE&lt;/span&gt;&lt;/strong&gt;, a famous Linux application which allows us to run Windows programs on Linux. We can then take all the advantages of Tanagra without asking any questions about compatibilities.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to install and run Tanagra under Ubuntu (a free of charge version of Linux) using WINE. We can fully use Tanagra in the Linux environment.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords:&lt;/strong&gt; linux, ubuntu, wine&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_under_Linux.pdf" target="_blank"&gt;en_Tanagra_under_Linux.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Ubuntu, &lt;a href="http://www.ubuntu.com/" target="_blank"&gt;http://www.ubuntu.com/&lt;/a&gt;&lt;br /&gt;Wine, &lt;a href="https://help.ubuntu.com/community/Wine" target="_blank"&gt;https://help.ubuntu.com/community/Wine&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4799220337499273941?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4799220337499273941'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4799220337499273941'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2009/01/tanagra-under-linux.html' title='Tanagra under Linux'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4364396627072449055</id><published>2008-12-26T00:34:00.000-08:00</published><updated>2008-12-26T00:39:20.311-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Logistic regression - Software comparison</title><content type='html'>Logistic regression is a popular supervised learning method.&lt;br /&gt;&lt;br /&gt;There are several reasons for this. The theoretical foundation of the method is attractive. It is in line with the generalized regression. Thus the logistic regression is a well identified variant which one can implement according the kind of the dependent variable (class attribute). Their performance in prediction is comparable to the other approaches. Furthermore, it puts forward some indicators for the interpretation of the results. Among them, the famous odds-ratio enables to identify precisely the contribution of each predictor.&lt;br /&gt;&lt;br /&gt;Logistic regression is available in many free tools. In this tutorial, we compare the implementation of this technique with Tanagra 1.4.27, R 2.7.2 (GLM command), Orange 1.0b2, Weka 3.5.6, and the package RWeka 0.3-13 for R. Beyond the comparison, this tutorial is also an opportunity to show how to achieve the succession of operations with these tools: importing an ARFF file (Weka file format); split the data into learning and test set; computing the predictive model on the learning set; testing the model on the test set; selecting the relevant variable using criterion in agreement with the logistic regression; evaluating again the performance of the simplified model.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: logistic regression, supervised learning, software comparison&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: BINARY LOGISTIC REGRESSION, SUPERVISED LEARNING, TEST, DISCRETE SELECT EXAMPLES&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Perfs_Reg_Logistique.pdf" target="_blank"&gt;en_Tanagra_Perfs_Reg_Logistique.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/wave_2_classes_with_irrelevant_attributes.zip" target="_blank"&gt;wave_2_classes_with_irrelevant_attributes.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;D. Garson, "&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm" target="_blank"&gt;Logistic Regression&lt;/a&gt;"&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Logistic_regression" target="_blank"&gt;Logistic Regression&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4364396627072449055?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4364396627072449055'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4364396627072449055'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/12/logistic-regression-software-comparison.html' title='Logistic regression - Software comparison'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-4277159662482597321</id><published>2008-12-23T01:15:00.000-08:00</published><updated>2008-12-23T01:19:16.617-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Association rules'/><title type='text'>Association rule mining - Software comparison</title><content type='html'>This document extends a previous tutorial dedicated to the comparison of various implementations of association rules mining (&lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/association-rule-learning.html"&gt;http://data-mining-tutorials.blogspot.com/2008/10/association-rule-learning.html&lt;/a&gt;). We had analyzed Tanagra, Orange and Weka. We extend here the comparison to R, RapidMiner and Knime.&lt;br /&gt;&lt;br /&gt;We handle an attribute-value dataset. It is not the usual data format for the association rule mining where the "native" format is rather the transactional database. We see in this tutorial than some of tools can automatically recode the data. Others require an explicit transformation. Thus, we must find the right components and the correct sequence of treatments to produce the transactional data format. The process is not always easy according to the software.&lt;br /&gt;&lt;br /&gt;The tools studied in this tutorial are: Tanagra 1.4.28, R 2.7.2 (arules package 0.6-6), Orange 1.0b2, RapidMiner Community Edition, Knime 1.3.5 and Weka 3.5.6. These programs load the data and perform the calculations in memory. When the size of the database increases, the real bottleneck is the memory available on our personal computer.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: association rule, frequent itemset&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: A PRIORI, A PRIORI PT&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Assoc_Rules_Comparison.pdf" target="_blank"&gt;en_Tanagra_Assoc_Rules_Comparison.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/credit-german.zip" target="_blank"&gt;credit-german.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;R. Rakotomalala, « &lt;a href="http://eric.univ-lyon2.fr/~ricco/cours/supports_data_mining.html#association" target="_blank"&gt;Règles d’association&lt;/a&gt; »&lt;br /&gt;Wikipedia, "&lt;a href="http://en.wikipedia.org/wiki/Association_rule_learning" target="_blank"&gt;Association rule learning&lt;/a&gt;"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-4277159662482597321?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4277159662482597321'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/4277159662482597321'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/12/association-rule-mining-software.html' title='Association rule mining - Software comparison'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-2254319807531827093</id><published>2008-12-20T02:05:00.000-08:00</published><updated>2008-12-20T02:11:24.653-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Clustering'/><title type='text'>K-Means - Classification of a new instance</title><content type='html'>The deployment is an important step of the Data Mining framework. In the case of a clustering, after the construction of clusters with a learning algorithm, we want to determine to which particular cluster (group) a new unlabelled instance belongs.&lt;br /&gt;&lt;br /&gt;In this tutorial, we use the K-Means algorithm. We assign each new instance to the group which is closest using the distance to the center of groups. The method is fair because the technique used to assign a group in the deployment phase is consistent with the learning algorithm. It is not true if we use another learning algorithm e.g. HAC (hierarchical agglomerative clustering) with de single linkage aggregation rule. The distance to the center of groups is inadequate in this context. Thus, the classification strategy must be consistent with the learning strategy.&lt;br /&gt;&lt;br /&gt;All the descriptors are discrete in our dataset. The K-Means algorithm does not handle directly this kind of data. We must transform them. We use a multiple correspondence analysis algorithm.&lt;br /&gt;&lt;br /&gt;In this tutorial, we compare the results of Tanagra 1.4.28 and R 2.7.2.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: data clustering, k-means, multiple correspondence analysis, factorial analysis, clusters interpretation, data exportation&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: MULTIPLE CORRESPONDENCE ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, CONTINGENCY CHI-SQUARE, EXPORT DATASET&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_KMeans_Deploiement.pdf" target="_blank"&gt;en_Tanagra_KMeans_Deploiement.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/banque_classif_deploiement.zip" target="_blank"&gt;banque_classif_deploiement.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;br /&gt;Wikipedia (en), « &lt;a href="http://en.wikipedia.org/wiki/K-means_algorithm" target="_blank"&gt;K-Means algorithm&lt;/a&gt; ».&lt;br /&gt;F. Husson, S. Lê, J. Josse, J. Mazet, « &lt;a href="http://factominer.free.fr/" target="_blank"&gt;FactoMineR&lt;/a&gt; – A package dedicated to Factor Analysis and Data Mining with R ».&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-2254319807531827093?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2254319807531827093'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/2254319807531827093'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/12/k-means-classification-of-new-instance.html' title='K-Means - Classification of a new instance'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-8390368243502377409</id><published>2008-11-13T01:16:00.000-08:00</published><updated>2008-11-13T01:21:58.703-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Decision tree and large dataset</title><content type='html'>Dealing with large dataset is on of the most important challenge of the Data Mining. In this context, it is interesting to analyze and to compare the performances of various free implementations of the learning methods, especially the computation time and the memory occupation. Most of the programs download all the dataset into memory. The main bottleneck is the available memory.&lt;br /&gt;&lt;br /&gt;In this tutorial, we compare the performance of several implementations of the C4.5 algorithm (Quinlan, 1993) when processing a file containing 500,000 observations and 22 variables. The programs used are: Knime 1.3.5; Orange 1.0b2; R (rpart package) 2.6.0; RapidMiner Community Edition; Sipina Research; Tanagra 1.4.27; Weka 3.5.6.&lt;br /&gt;&lt;br /&gt;Our data file is well-known artificial dataset described in the CART book (Breiman et al., 1984). We have generated a dataset with 500.000 observations. The class attribute has 3 values, there are 21 continuous predictors.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: c4.5, decision tree, classification tree, large dataset, knime, orange, r, rapidminer, sipina, tanagra, weka&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: SUPERVISED LEARNING, C4.5&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Perfs_Comp_Decision_Tree.pdf" target="_blank"&gt;en_Tanagra_Perfs_Comp_Decision_Tree.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/wave500k.zip" target="_blank"&gt;wave500k.zip&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;: R. Quinlan, « C4.5 : Programs for Machine Learning », Morgan Kaufman, 1993.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-8390368243502377409?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8390368243502377409'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/8390368243502377409'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-large-dataset.html' title='Decision tree and large dataset'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-6387845628072692266</id><published>2008-11-10T21:39:00.000-08:00</published><updated>2008-11-10T21:45:32.766-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><title type='text'>Decision tree and cross validation (continued)</title><content type='html'>In a &lt;a href="http://data-mining-tutorials.blogspot.com/2008/10/learning-classification-tree-software.html"&gt;previous tutorial&lt;/a&gt;, we compare the implementation of the decision tree induction and cross validation evaluation performances with three programs: TANAGRA, ORANGE and WEKA.&lt;br /&gt;&lt;br /&gt;In this paper, we extent the same framework for the comparison of three new programs: R 2.7.2, KNIME 1.3.51 and RAPIDMINER Community Edition.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: supervised learning, decision tree, classification tree, classifier assessment&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: Supervised learning, C-RT, Cross validation&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Validation_Croisee_Suite.pdf" target="_blank"&gt;en_Tanagra_Validation_Croisee_Suite.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart.zip" target="_blank"&gt;heart.zip&lt;/a&gt;&lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/heart.txt" target="_blank"&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-6387845628072692266?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6387845628072692266'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/6387845628072692266'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-cross-validation.html' title='Decision tree and cross validation (continued)'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-708086023930514582</id><published>2008-11-09T23:48:00.000-08:00</published><updated>2008-11-09T23:51:32.784-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Interactive induction of decision tree</title><content type='html'>Interactive induction of decision trees with SIPINA.&lt;br /&gt;&lt;br /&gt;Various functionalities of SIPINA are not documented. In this tutorial, we show how to explore nodes of a decision tree, in order to obtain a better understanding of the characteristics of the subpopulation on a node. This is an important task, for instance when we want to validate the rules with an expert domain.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: decision tree, classification tree, interactive analysis&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/doc/en_sipina_interactive.pdf" target="_blank"&gt;en_sipina_interactive.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/dataset/blood_pressure_levels.xls" target="_blank"&gt;blood_pressure_levels.xls&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-708086023930514582?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/708086023930514582'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/708086023930514582'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/11/interactive-induction-of-decision-tree.html' title='Interactive induction of decision tree'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-1300820465897562045</id><published>2008-11-09T23:43:00.000-08:00</published><updated>2008-11-09T23:46:43.784-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Decision tree and contextual descriptive statistics</title><content type='html'>SIPINA proposes some descriptive statistics functionalities. In itself, the information is not really exceptional; there is a large number of freeware which do that.&lt;br /&gt;&lt;br /&gt;It becomes more interesting when we combine these tools with the decision tree. The exploratory phase is improved. Indeed, every node of the tree corresponds to a subpopulation. The variables which do not appear in the tree are not necessarily irrelevant. Perhaps, some of them were hided during the tree learning which selects the “best” variables. By computing contextual descriptive statistics, in connection with the each node, we better understand the prediction rules highlighted during the induction process.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: descriptive statistics, decision tree, interactive exploration&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/doc/en_sipina_descriptive_statistics.pdf" target="_blank"&gt;en_sipina_descriptive_statistics.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/dataset/heart_disease_male.xls" target="_blank"&gt;heart_disease_male.xls&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-1300820465897562045?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1300820465897562045'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/1300820465897562045'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/11/decision-tree-and-contextual.html' title='Decision tree and contextual descriptive statistics'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-5234908511404244119</id><published>2008-11-09T23:35:00.000-08:00</published><updated>2008-11-09T23:42:34.738-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Decision tree'/><category scheme='http://www.blogger.com/atom/ns#' term='Supervised Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Sipina'/><title type='text'>Cost-sensitive Decision Tree</title><content type='html'>Error rate evaluation is a key point of the induction process. A usual approach is to partition the dataset in a learning set, which is used for the induction of the classification model, and in a test set, which is used for the performance evaluation.&lt;br /&gt;&lt;br /&gt;The first subject of this tutorial is to show how to make a partition of the dataset with SIPINA. Then, we build the tree on the first part of the dataset. Later, we classify the examples of the second part of the dataset. We compare the predicted value and the true value. We obtain honest error rate estimation.&lt;br /&gt;&lt;br /&gt;The second main subject of this document is to show how to take into account the misclassification costs during the learning process and the evaluation process. We use a slightly modified version of C4.5 (Quinlan, 1993).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: decision trees, C4.5, classifier evaluation, cost-sensitive learning, F-Measure, spams detection&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/doc/en_sipina_cost_sensitive.pdf" target="_blank"&gt;en_sipina_cost_sensitive.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/dataset/spam.xls" target="_blank"&gt;spam.xls&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-5234908511404244119?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5234908511404244119'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/5234908511404244119'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/11/cost-sensitive-decision-trees.html' title='Cost-sensitive Decision Tree'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-9088007003676927455</id><published>2008-11-09T23:16:00.000-08:00</published><updated>2008-11-09T23:30:26.635-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Semi-partial correlation</title><content type='html'>The semi-partial correlation measures the additional information of an independent variable (X), compared with one or several control variables (Z1,..., Zp), that we can used for the explanation of a dependent variable (Y).&lt;br /&gt;&lt;br /&gt;We can compute the semi-partial correlation in various ways. The square of the semi-partial correlation can be obtained with the difference between the square of the multiple correlation coefficient of regression Y / X, Z1...,Zp (including X) and the same quantity for the regression Y / Z,...,Zp (without X).&lt;br /&gt;&lt;br /&gt;We can also obtain the semi-partial correlation by computing the residuals of the regression X/Z1,...,Zp; then, we compute the correlation between Y and these residuals. In other words, we seek to quantify the relationship between X and Y, by removing the effect of Z on the latter. The semi-partial correlation is an asymmetrical measure.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show the different ways for computing the semi-partial correlation.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: correlation, Pearson's correlation, semi-partial correlation, multiple linear regression&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: LINEAR CORRELATION, MULTIPLE LINEAR REGRESSION, SEMI-PARTIAL CORRELATION&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Semi_Partial_Correlation.pdf" target="_blank"&gt;en_Tanagra_Semi_Partial_Correlation.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/cars_semi_partial_correlation.xls" target="_blank"&gt;cars_semi_partial_correlation.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;: M. Brannick, « &lt;a href="http://luna.cas.usf.edu/~mbrannic/files/regression/Partial.html" target="_blank"&gt;Partial and Semipartial Correlation&lt;/a&gt; », University of South Florida.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-9088007003676927455?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/9088007003676927455'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/9088007003676927455'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/11/semi-partial-correlation.html' title='Semi-partial correlation'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5496815755861370799.post-265754778257497397</id><published>2008-11-09T23:09:00.000-08:00</published><updated>2008-11-09T23:15:20.287-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistical methods'/><title type='text'>Partial correlation</title><content type='html'>Partial correlation measures the degree of association between two random variables, with the effect of a set of controlling variables removed.&lt;br /&gt;&lt;br /&gt;In this tutorial, we show how to use the PARTIAL CORREALTION component of Tanagra. We reproduce the example described online (see Reference). Thus, in addition to the presentation of the theoretical method, we can trace the detail of all the calculations that we will achieve.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: correlation, Pearson's correlation, rank correlation, Spearman's rho, partial correlation&lt;br /&gt;&lt;strong&gt;Components&lt;/strong&gt;: LINEAR CORRELATION, SPEARMAN’S RHO, PARTIAL CORRELATION&lt;br /&gt;&lt;strong&gt;Tutorial&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Partial_Correlation.pdf" target="_blank"&gt;en_Tanagra_Partial_Correlation.pdf&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/wechsler_adult_intelligence_scale.xls" target="_blank"&gt;wechsler_adult_intelligence_scale.xls&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Reference&lt;/strong&gt;: S. Rathbun, A. Wiesner, « STAT 505 – Applied Multivariate Statistical Analysis », The Pennsylvania State University, &lt;a href="http://www.stat.psu.edu/online/development/stat505/07_partcor/01_partcor_intro.html" target="_blank"&gt;Lesson 7 : Partial Correlations&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5496815755861370799-265754778257497397?l=data-mining-tutorials.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/265754778257497397'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5496815755861370799/posts/default/265754778257497397'/><link rel='alternate' type='text/html' href='http://data-mining-tutorials.blogspot.com/2008/11/partial-correlation.html' title='Partial correlation'/><author><name>Tanagra</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry></feed>
