robust regression diagnostics · motivating example 2 1.8 ... regression analysis. let us apply...

1.1

Robust Regression Diagnostics

A Graduate Course Presented at theFaculty of Economics and Political

Sciences, Cairo University

Professor Ali S. Hadi

The American University in Cairo

and Cornell University

www.ilr.cornell.edu/~hadi

[email protected]

[email protected]

Copyright © 2017 by Ali S. Hadi

1.2

Regression Analysis

Input Computer Output

Data

Model

Fitting Method

Assumptions

Estimated Parameters

Test Statistics

Graphs

Tables

We like to know how sensitive the output

is to small perturbation in the input.

1.3

Motivating Example 1

New York Rivers Data:

In a 1976 study on land use and water quality in New York rivers, the total nitrogen content was used as a measure of water quality in the 20 New York State river basins.

1.4

New York Rivers

1. Olean 2. Cassadaga3. Oatka 4. Neversink5. Hackensack 6. Wappinger7. Fishkill 8. Honeoye9. Susquehanna 10.Chenango11. Tioughnioga 12.West Canada13. East Canada 14.Saranac15. Ausable 16. Black17. Schoharie 18. Raquette19. Oswegatchie 20.Cohocton

See map of NY State

1.5

Variables Used

Active Agriculture (X1): percentage of land area currently in agricultural use

Forest (X2): percentage of land area in forest

Residential (X3): percentage of residential land area

Commercial/Industrial (X4): percentage of land area used in either commercial or manufacturing

Total Nitrogen (Y): mean concentration (mg/liter)

based on samples taken at regular intervals during the spring, summer, and fall months

1.6

River X1 X2 X3 X4 Y

1 26 63 1.2 0.29 1.102 29 57 0.7 0.09 1.013 54 26 1.8 0.58 1.904 2 84 1.9 1.98 1.005 3 27 29.4 3.11 1.996 19 61 3.4 0.56 1.427 16 60 5.6 1.11 2.048 40 43 1.3 0.24 1.659 28 62 1.1 0.15 1.0110 26 60 0.9 0.23 1.2111 26 53 0.9 0.18 1.3312 15 75 0.7 0.16 0.7513 6 84 0.5 0.12 0.7314 3 81 0.8 0.35 0.8015 2 89 0.7 0.35 0.7616 6 82 0.5 0.15 0.8717 22 70 0.9 0.22 0.8018 4 75 0.4 0.18 0.8719 21 56 0.5 0.13 0.6620 40 49 1.1 0.13 1.25

1.7

Regression Summary

Observation Deleted

T-value None 5 4 7

t0 1.40 2.08 1.21 1.77

t1 0.39 0.25 0.92 0.68t2 -0.93 -1.45 -0.74 -1.13t3 -0.21 4.08 -3.15 0.08

t4 1.86 0.66 4.45 1.83

.4,3,2,1,0;)ˆ.(.

ˆ j

est

j

j

j

1.8

Motivating Example 2

Homicides Data:

This data set is a result of a study

investigating the role of firearms in

accounting for the rising homicide rate in

Detroit. The data is for the years 1961-

1973.

1.9

Variables Used

FTP: # of full-time police per 100,000 population

UEMP: % of the population unemployed

MAN: # of manufacturing workers (in thousands)

LIC: # of handgun licenses issued per 100,000 population

CLEAR: Percent of homicides cleared by arrest

WM: # of white males in the population

GOV: # of government workers (in thousands)

HOM: # of homicides per 100,000 population

1.10

Estimated Coefficient (T-value)

Coef. Model 1 Model 2 Model 3

Const. 199.31 252.59 -20.05(2.4) (11.1) (-1.5)

MAN -0.13 -0.14 -0.09(-4.5) (-5.4) (-2.8)

WM -0.00 -0.00(-2.7) (-15.9)

GOV 0.10 0.51(0.7) (12.0)

1.11

Model Selection Criteria

Minimum Residual Mean Square (RMS):

where SSE = is the residual

sum of squares, n is the number of

observations, p is the number of

regression coefficients.

,ˆ 2

pn

SSE

n

iii yy

1

2)ˆ(

1.12


Maximum R-Square:

where SST is the total sum of

squares.

Note: Not good for comparing models with

different number of predictors.

,12

SST

SSER

n

ii yy

1

2)(

1.13


Maximum Adjusted R-Square:

where SST is the total sum of

squares.

Note: The sum of squares are adjusted for

their degrees of freedom. It imposes a

penalty for including insignificant variables.

,)1/(

)/(12

nSST

pnSSERa

n

ii yy

1

2)(

1.14


Mallows C-p: For a model with p predictors,

where is a good estimate of 2 (usually

obtained from the full model).

Note: The above are standard well-known

criteria, used to judge the adequacy of fit and

to guide variable selection procedures.

),2(ˆ

Y)PI(Y2

npC

T

p

2̂

1.15

Variable Selection Methods

Backward Elimination:Start with the full model, then delete the least significant variable (the one with the smallest T-value or largest p-value).

Repeat until all regression coefficients in the model are significant.

1.16


Forward Selection:Start with the empty model, then add the most significant variable (the one with the largest T-value or smallest p-value).

Repeat until all candidate variables to enter the model have insignificant regression coefficients.

1.17


Stepwise Method:A combination of the Backward and Forward methods.

Other Methods: See any textbook on regression analysis.

Let us apply some of these methods to the Homicides Data.

1.18

Backward Elimination Method

Variable RMS AdjustedRemoved

None 9 0.97

GOV 9 0.97

MAN 30 0.89

WM 268 0.69

2̂

Accordingly, GOV is the least importantvariable.

2aR

1.19

Forward Selection Method

GOV 24 0.91

MAN 15 0.94

WM 9 0.97

Accordingly, GOV is the most important variable.

Variable RMS AdjustedAdded 2̂ 2

aR

1.20

Reasons for Inconsistency

GOV

MAN

GOV

WM

1.21

Summary

Conclusions drawn from fitted models

that are highly sensitive to a particular

data point, a particular variable, or a

particular assumption should be treated

cautiously.

1.22

Course Outline

1. Motivating Examples

2. Selected References

3. Review of Least Squares (LS)

Regression Analysis

4. The Iterative Nature of Regression

Analysis

5. The Projection Matrix and its Properties

1.23

Course Outline

6. Sensitivity of the LS fit with Respect to:

• Variables (column sensitivity)

• Observations (row sensitivity)

• Errors of Measurements

• Probability Law of Errors

1.24

Course Outline

7. Robust Regression and Outlier Detection:

• The Brute Force Method

• The LMS Method

• The LAV Method

• The BACON Approach

• The RIRLS Method

1.25

Selected References: Selected Books

• Birkes, D. and Dodge, Y. (1993), Alternative

Methods of Regression, New York: Wiley.

• Chatterjee, S. and Hadi, A.S. (1988), Sensitivity

Analysis in Linear Regression, New York: Wiley.

• Chatterjee, S. and Hadi (2006), Regression

Analysis By Examples, Fifth Edition, New York:

Wiley.

• Rousseeuw, P. J. and Leroy, A. (1987), Robust

Regression and Outlier Detection, New York:

Wiley.

1.26

Selected References: Selected Articles

Gould, W. and Hadi, A. S. (1993), “Identifying

Multivariate Outliers,” Stata Technical Bulletin,

11, 2–5.

Hadi, A. S. (1992), “Identifying Multiple Outliers in

Multivariate Data,” Journal of the Royal Statistical

Society, (B), 54, No. 3, 761–771.

Hadi, A. S. (1992), “A New Measure of Overall

Potential Influence in Linear Regression,”

Computational Statistics and Data Analysis, 14, 1–

27.

1.27

Selected References: Selected Articles

Hadi, A. S. (1994), “A Modification of a Method for

the Detection of Outliers in Multivariate

Samples,” Journal of the Royal Statistical Society,

Series (B), 56, 393–396.

Hadi, A. S. and Simonoff , J. S. (1993), “Procedures

for the Identification of Multiple Outliers in

Linear Models,” Journal of the American

Statistical Association, 88, 1264–1272.

1.28

Selected References: Articles

Hadi, A. S. and Simonoff, J. S. (1994), “Improving

the Estimation and Outlier Identification

Properties of the Least Median of Squares and

Minimum Volume Ellipsoid Estimators,”

Parisankhyan Sammikkha, 1, 61–70.

Hadi, A. S. and Simonoff, J. S. (1997), “A More

Robust Outlier Identifier for Regression Data,”

Bulletin of the International Statistical Institute,

281–282.

Munier, S. (1999), “Multiple Outlier Detection in

Logistic Regression,” Student, 3, 117 – 126.

1.29

Course Outline

1. Motivating Examples

2. Selected References

3. Review of Least Squares (LS)

Regression Analysis

4. The Iterative Nature of Regression

Analysis

5. The Projection Matrix and its Properties

robust regression diagnostics · motivating example 2 1.8 ... regression analysis. let us apply...

Documents