Sunday, April 22, 2012

Pentaho Data Integration - Kettle

The Pentaho BI Suite is an open source Business Intelligence suite with integrated reporting, dashboard, data mining, workflow and ETL capabilities (http://en.wikipedia.org/wiki/Pentaho).

In this tutorial, we talk about the Pentaho BI Suite Community Edition (CE) which is freely downloadable. More precisely, we present the Pentaho Data Integration (PDI-CE) , called also Kettle. We show briefly how to load a dataset and perform a simplistic data analysis. The main goal of this tutorial is to introduce a next one focused on the deployment of the models designed with Knime, Sipina or Weka by using PDI-CE.

This document is based on the 4.0.1 stable version of PDI-CE.

Keywords: ETL, pentaho data integration, community edition, kettle, BI, business intelligence, data importation, data transformation, data cleansing
Tutorial: PDI-CE
Dataset: titanic32x.csv.zip
References :
Pentaho, Pentaho Community

Monday, April 9, 2012

Mining frequent itemsets

Searching regularities from dataset is the main goal of the data mining. They may have various representations. In the market basket analysis, we search the co occurrences of goods (items) i.e. the goods which are often purchased simultaneously. They are called “frequent itemset”. For instance, one result may be "milk and bread are purchased simultaneously in 10% of caddies".

Frequent itemset mining is often presented as the preceding step of the association rule learning algorithm. At the end of the process, we highlight the direction of the relation. We obtain rules. For instance, a rule may be "90% of the customers which buy milk and bread will purchase butter also". This kind of rule can be used in various manners. For instance, we can promote the sales of milk and bread in order to increase the sales of butter.

In fact, frequent itemsets provide also valuable information. Detecting the goods which are purchased simultaneously enables to understand the relation between them. It is a kind of variant of the clustering analysis. We search the items which come together. For instance, we can use this kind of information in order to reorganize the shelves of the store.

In this tutorial, we describe the use of the FREQUENT ITEMSETS component under Tanagra. It is based on the Borgelt's “apriori.exe” program. We use a very small dataset. It enables to everyone to reproduce manually the calculations. But, in a first time, we describe some definitions about the frequent itemset mining process.

Keywords: frequent itemsets, closed itemsets, maximal itemsets, generator itemsets, association rules, R software, arules package
Components: FREQUENT ITEMSETS
Tutorial: en_Tanagra_Itemset_Mining.pdf
Dataset: itemset_mining.zip
References :
C. Borgelt, "A priori - Association Rule Induction / Frequent Item Set Mining"
R. Lovin, "Mining Frequent Patterns"

Sunday, April 1, 2012

Sipina add-on for OOCalc

Combining a spreadsheet with the data mining tools is essential for the popularity of these last ones. Indeed, when we deal with a moderate sized dataset (thousands of rows and tens of variables), the spreadsheet is a practical tool for the data preparation. This is also a valuable tool for the preparation of the reports. It is thus not surprising that Excel, and generally speaking a spreadsheet, is one the most used tool by data miners.

Both Tanagra and Sipina provide an add-on for Excel. The add-on enables to insert a data mining tool menu into the spreadsheet. The user can select and send the dataset to Tanagra (or Sipina), which is automatically launched. But, only Tanagra provides an add-on for Open Office Calc and Libre Office Calc. It is not available for Sipina.

This omission has been corrected for this new version of Sipina (Sipina 3.9). In this tutorial, we show how to install and use the “SipinaLibrary.oxt” add-on for Open Office Calc 3.3.0 (OOCalc). The process is the same for Libre Office 3.5.1.

Keywords: calc, open office, libre office, oocalc, add-on, add-in, sipina
Tutorial: en_sipina_calc_addon.pdf
Dataset: heart.xls
References :
Tutoriel Tanagra - Sipina add-in for Excel
Tutoriel Tanagra - Tanagra add-on for Open Office Calc 3.3
Open Office - http://www.openoffice.org
Libre Office - http://www.libreoffice.org/