Saturday, June 30, 2012

SAS Add-In 4.3 for Excel

The connection between a data mining tool and a spreadsheet application such as Excel is a really valuable feature. We benefit from the powerful of the first one, and the popularity and the easy to use of the second one. Many people use a spreadsheet in their data preparation phase. Recently, I have presented an add-in for the connection between R and Excel. In this document, I describe a similar tool for the SAS software.

SAS is a popular tool, well-known of the statisticians. But the use of SAS is not really simple for the non-specialist people. We must know the syntax of the commands before to perform a statistical analysis. With the SAS add-in for Excel, some of the SAS drawbacks are alleviated: we do not need to load and organize the dataset into a bank; we do not need to know the command syntax to perform an analysis and set the associated parameters (we use a menu and dialog boxes instead); the results are automatically incorporated in a new sheet of an Excel workbook (the post processing of the results becomes easy).

In this tutorial, I describe the behavior of the add-in for various kinds of analyses (nonparametric statistic, logistic regression). We compare the results with those of Tanagra.

Keywords: excel, sas, add-on, add-in, logistic regression, nonparametric test
Components : MANN-WHITNEY COMPARISON, KRUSKAL-WALLIS 1-WAY ANOVA, MEDIAN TEST, VAN DER WAERDEN 1-WAY ANOVA, ANSARI-BRADLEY SCALE TEST, KLOTZ SCALE TEST, MOOD SCALE TEST
Tutorial: en_Tanagra_SAS_AddIn_4_3_for_Excel.pdf
Dataset: scoring_dataset.xls
References :
SAS - http://www.sas.com/
SAS - "SAS Add-in for Microsoft Office"
Tanagra Tutorial - "Tanagra Add-In for Office 2007 and Office 2010"

Tuesday, June 12, 2012

Tanagra - Version 1.4.45

New features for the principal component analysis (PCA).

PRINCIPAL COMPONENT ANALYSIS. Additional outputs for the component: Scree plot and variance explained cumulative curve; PCA Correlation Matrix - Some outputs are provided for the detection of the significant factors (Kaiser-Guttman, Karlis-Saporta-Spinaki, Legendre-Legendre broken-stick test); PCA Correlation Matrix - Bartlett's sphericity test is performed and the Kaiser's measure of sampling adequacy (MSA) is calculated; PCA Correlation Matrix - The correlation matrix and the partial correlations between each pair of variables controlling for all other variables (the negative anti-image correlation) are produced.

PARALLEL ANALYSIS. The component calculates the distribution of eigenvalues for a set of randomly generated data. It proceeds by randomization. It applies to the principal components analysis and te multiple correspondence analysis. A factor is considered significant if its observed eigenvalue is greater than the 95-th percentile (this setting can be modified).

BOOTSTRAP EIGENVALUES. It calculates by bootstrap approach the confidence intervals of eigenvalues. A factor is considered significant if its eigenvalue is greater than a threshold which depends on the underlying factor method (PCA or MCA) method, or if the lower bound of the eigenvalue of a factor is greater than higher bound of the following one. The confidence level 0.90 can be modified. This component can be applied to the principal component analysis or the multiple correspondence analysis.

JITTERING. Jittering feature is incorporated to the scatter plot components (SCATTERPLOT, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL, VIEW MULTIPLE SCATTERPLOT).

RANDOM FOREST. The not used memory is released after the decision tree learning process. This feature is especially useful when we use an ensemble learning approach where we store a large number of trees in memory (BAGGING, BOOSTING, RANDOM FOREST). The memory occupation is reduced. The computation capacity is improved.

Download page : setup