Wednesday, December 9, 2009

Outliers and influential points in regression

The analysis of outliers and influential points is an important step of the regression diagnostics. The goal is to detect (1) the points which are very different to the others (outliers) i.e. they seem do not belong to the analyzed population; or (2) the points that if they are removed (influential points), leads us to a different model. The distinction between these kinds of points is not always obvious.

In this tutorial, we implement several indicators for the analysis of outliers and influential points. To avoid confusion about the definitions of indicators (some indicators are calculated differently from one tool to another), we compare our results with state-of-the-art tool such as SAS and R. In a first step, we give the results described into the SAS documentation. In a second step, we describe the process and the results under Tanagra and R. In conclusion, we note that these tools give the same results.

Keywords: linear regression, outliers, influential points, standardized residuals, studentized residuals, leverage, dffits, cook's distance, covratio, dfbetas, R software
Components: Multiple linear regression, Outlier detection, DfBetas
Tutorial: en_Tanagra_Outlier_Influential_Points_for_Regression.pdf
Dataset: USPopulation.xls
References:
SAS STAT User’s Guide, « The REG Procedure – Predicted and Residual Values »