Monday, August 30, 2010

Connecting Sipina and Excel using OLE

The connection between a data mining tool and Excel (and more generally spreadsheet) is a very important issue. We had addressed many times this topic in our tutorials. With hindsight, I think the solution based on add-ins for Excel is the best one, both for SIPINA and for TANAGRA. It is simple, reliable and highly efficient. It does not require developing specific versions. The connection with Excel is a simple additional functionality of the standard distribution.

Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina using OLE is prohibitive when the database size grows.

Thus, XL-SIPINA is essentially an attempt short-lived. There is always a bit of nostalgia when I am back on solutions I have explored, and I have finally abandoned. Can be also I have not completely explored this solution.

Last, the application was initially developed for Office 97. I note that it still up to date today, it works fine with Office 2010.

Keywords: excel, tableur, sipina, xls, xlsx, xl-sipina, decision tree induction
Tutorial: en_xls_sipina.pdf
Dataset: autos

Friday, August 27, 2010

Sipina add-in for Excel

The data importation is a bottleneck for Data Mining Tools. The majority of users are working with a spreadsheet tool such as Excel, mainly in the coupling with specialized software for data mining (see KDnuggets polls). Therefore, a recurring issue for users is "how to send my data from Excel to SIPINA?"

It is possible to import different types of formats into SIPINA. About Excel workbooks, one particular device has been implemented.

An add-in is automatically copied to the computer during the installation process. It must be integrated into Excel. The add-in incorporates a new menu into Excel. After selecting the data range, the user only has to activate it, this leads to the following: (1) SIPINA starts automatically, (2) the data are transferred via the clipboard and (3) SIPINA considers the first row of the range of cells corresponds to the names of variables, (4) columns with numerical values of the variables are quantitative (5) columns with alphanumeric values are categorical variables.

Unlike the other tutorials, the sequence of manipulations is described in a video. The description is right only for the versions up to Excel 2003. Another tutorial about the using of the add-in under Office 2007 and Office 2010 is described below.

Keywords: excel file format, add-in, decision tree
Installing the add-in : sipina_xla_installation.htm
Using the add-in: sipina_xla_processing.htm

Tanagra add-in for Office 2007 and Office 2010

The "tanagra.xla" add-in for Excel contributes to the wide diffusion of Tanagra. The principle is simple. It is to embed a Tanagra menu in Excel. Thus the user can run statistical calculations without having to leave the spreadsheet. It seems simplistic. But this feature facilitates immensely the work of data miner. Indeed, the spreadsheet is one of the most used tools for preparing dataset (see KDNuggets Polls: Tools / Languages for Data Cleaning - 2008). By embedding the data mining tool in the spreadsheet environment, it avoids to the practitioner the tedious and repetitive manipulations: importing the dataset, exporting the dataset, checking the compatibilities between data file formats, etc.

The installation and the use of the "tanagra.xla" add-in under the previous versions of Office are described elsewhere (Office 1997 to Office 2003). This description is obsolete for the latest version of Office because the organization of the menus is modified for these versions i.e. Office 2007 and Office 2010. And yet, the add-in is still operational. In this tutorial, we show how to install and to use the Tanagra add-in under Office 2007 and 2010.

This transition to recent versions of Excel is absolutely not without consequences. Indeed, compared to the previous Excel versions, Excel 2007 (and 2010) and can handle more important rows and columns. We can process a dataset up to 1,048,575 observations (the first line corresponds to the variable names) and 16,384 variables. In this tutorial, we will treat a database with 100,000 observations and 22 variables (wave100k.xlsx). This is a version of the famous waveform database. Note that this file, because of the number of rows, cannot be manipulated by earlier versions of Excel.

The process described in this document is also valid for the SIPINA add-in (sipina.xla).

Keywords: data importation, excel, add-in
Components: VIEW DATASET
Tutorial: en_Tanagra_Add_In_Excel_2007_2010.pdf
Dataset: wave100k.xlsx
Tanagra, "Tanagra and Sipina add-ins for Excel 2016", June 2016.
Tanagra, "Excel file handling using an add-in".
Tanagra, "OOo Calc file handling using an add-in".
Tanagra, "Launching Tanagra from OOo Calc under Linux".
Tanagra, "Sipina add-in for Excel"