robust regression diagnostics · motivating example 2 1.8 ... regression analysis. let us apply...

15
1.1 Robust Regression Diagnostics A Graduate Course Presented at the Faculty of Economics and Political Sciences, Cairo University Professor Ali S. Hadi The American University in Cairo and Cornell University www.ilr.cornell.edu/~hadi [email protected] ali - [email protected] Copyright © 2017 by Ali S. Hadi 1.2 Regression Analysis Input Computer Output Data Model Fitting Method Assumptions Estimated Parameters Test Statistics Graphs Tables We like to know how sensitive the output is to small perturbation in the input.

Upload: dangkiet

Post on 29-Jun-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.1

Robust Regression Diagnostics

A Graduate Course Presented at theFaculty of Economics and Political

Sciences, Cairo University

Professor Ali S. Hadi

The American University in Cairo

and Cornell University

www.ilr.cornell.edu/~hadi

[email protected]

[email protected]

Copyright © 2017 by Ali S. Hadi

1.2

Regression Analysis

Input Computer Output

Data

Model

Fitting Method

Assumptions

Estimated Parameters

Test Statistics

Graphs

Tables

We like to know how sensitive the output

is to small perturbation in the input.

Page 2: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.3

Motivating Example 1

New York Rivers Data:

In a 1976 study on land use and water quality in New York rivers, the total nitrogen content was used as a measure of water quality in the 20 New York State river basins.

1.4

New York Rivers

1. Olean 2. Cassadaga3. Oatka 4. Neversink5. Hackensack 6. Wappinger7. Fishkill 8. Honeoye9. Susquehanna 10.Chenango11. Tioughnioga 12.West Canada13. East Canada 14.Saranac15. Ausable 16. Black17. Schoharie 18. Raquette19. Oswegatchie 20.Cohocton

See map of NY State

Page 3: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.5

Variables Used

Active Agriculture (X1): percentage of land area currently in agricultural use

Forest (X2): percentage of land area in forest

Residential (X3): percentage of residential land area

Commercial/Industrial (X4): percentage of land area used in either commercial or manufacturing

Total Nitrogen (Y): mean concentration (mg/liter)

based on samples taken at regular intervals during the spring, summer, and fall months

1.6

River X1 X2 X3 X4 Y

1 26 63 1.2 0.29 1.102 29 57 0.7 0.09 1.013 54 26 1.8 0.58 1.904 2 84 1.9 1.98 1.005 3 27 29.4 3.11 1.996 19 61 3.4 0.56 1.427 16 60 5.6 1.11 2.048 40 43 1.3 0.24 1.659 28 62 1.1 0.15 1.0110 26 60 0.9 0.23 1.2111 26 53 0.9 0.18 1.3312 15 75 0.7 0.16 0.7513 6 84 0.5 0.12 0.7314 3 81 0.8 0.35 0.8015 2 89 0.7 0.35 0.7616 6 82 0.5 0.15 0.8717 22 70 0.9 0.22 0.8018 4 75 0.4 0.18 0.8719 21 56 0.5 0.13 0.6620 40 49 1.1 0.13 1.25

Page 4: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.7

Regression Summary

Observation Deleted

T-value None 5 4 7

t0 1.40 2.08 1.21 1.77

t1 0.39 0.25 0.92 0.68t2 -0.93 -1.45 -0.74 -1.13t3 -0.21 4.08 -3.15 0.08

t4 1.86 0.66 4.45 1.83

.4,3,2,1,0;)ˆ.(.

ˆ j

est

j

j

j

1.8

Motivating Example 2

Homicides Data:

This data set is a result of a study

investigating the role of firearms in

accounting for the rising homicide rate in

Detroit. The data is for the years 1961-

1973.

Page 5: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.9

Variables Used

FTP: # of full-time police per 100,000 population

UEMP: % of the population unemployed

MAN: # of manufacturing workers (in thousands)

LIC: # of handgun licenses issued per 100,000 population

CLEAR: Percent of homicides cleared by arrest

WM: # of white males in the population

GOV: # of government workers (in thousands)

HOM: # of homicides per 100,000 population

1.10

Estimated Coefficient (T-value)

Coef. Model 1 Model 2 Model 3

Const. 199.31 252.59 -20.05(2.4) (11.1) (-1.5)

MAN -0.13 -0.14 -0.09(-4.5) (-5.4) (-2.8)

WM -0.00 -0.00(-2.7) (-15.9)

GOV 0.10 0.51(0.7) (12.0)

Page 6: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.11

Model Selection Criteria

Minimum Residual Mean Square (RMS):

where SSE = is the residual

sum of squares, n is the number of

observations, p is the number of

regression coefficients.

,ˆ 2

pn

SSE

n

iii yy

1

2)ˆ(

1.12

Model Selection Criteria

Maximum R-Square:

where SST is the total sum of

squares.

Note: Not good for comparing models with

different number of predictors.

,12

SST

SSER

n

ii yy

1

2)(

Page 7: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.13

Model Selection Criteria

Maximum Adjusted R-Square:

where SST is the total sum of

squares.

Note: The sum of squares are adjusted for

their degrees of freedom. It imposes a

penalty for including insignificant variables.

,)1/(

)/(12

nSST

pnSSERa

n

ii yy

1

2)(

1.14

Model Selection Criteria

Mallows C-p: For a model with p predictors,

where is a good estimate of 2 (usually

obtained from the full model).

Note: The above are standard well-known

criteria, used to judge the adequacy of fit and

to guide variable selection procedures.

),2(ˆ

Y)PI(Y2

npC

T

p

Page 8: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.15

Variable Selection Methods

Backward Elimination:Start with the full model, then delete the least significant variable (the one with the smallest T-value or largest p-value).

Repeat until all regression coefficients in the model are significant.

1.16

Variable Selection Methods

Forward Selection:Start with the empty model, then add the most significant variable (the one with the largest T-value or smallest p-value).

Repeat until all candidate variables to enter the model have insignificant regression coefficients.

Page 9: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.17

Variable Selection Methods

Stepwise Method:A combination of the Backward and Forward methods.

Other Methods: See any textbook on regression analysis.

Let us apply some of these methods to the Homicides Data.

1.18

Backward Elimination Method

Variable RMS AdjustedRemoved

None 9 0.97

GOV 9 0.97

MAN 30 0.89

WM 268 0.69

Accordingly, GOV is the least importantvariable.

2aR

Page 10: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.19

Forward Selection Method

GOV 24 0.91

MAN 15 0.94

WM 9 0.97

Accordingly, GOV is the most important variable.

Variable RMS AdjustedAdded 2̂ 2

aR

1.20

Reasons for Inconsistency

GOV

MAN

GOV

WM

Page 11: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.21

Summary

Conclusions drawn from fitted models

that are highly sensitive to a particular

data point, a particular variable, or a

particular assumption should be treated

cautiously.

1.22

Course Outline

1. Motivating Examples

2. Selected References

3. Review of Least Squares (LS)

Regression Analysis

4. The Iterative Nature of Regression

Analysis

5. The Projection Matrix and its Properties

Page 12: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.23

Course Outline

6. Sensitivity of the LS fit with Respect to:

• Variables (column sensitivity)

• Observations (row sensitivity)

• Errors of Measurements

• Probability Law of Errors

1.24

Course Outline

7. Robust Regression and Outlier Detection:

• The Brute Force Method

• The LMS Method

• The LAV Method

• The BACON Approach

• The RIRLS Method

Page 13: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.25

Selected References: Selected Books

• Birkes, D. and Dodge, Y. (1993), Alternative

Methods of Regression, New York: Wiley.

• Chatterjee, S. and Hadi, A.S. (1988), Sensitivity

Analysis in Linear Regression, New York: Wiley.

• Chatterjee, S. and Hadi (2006), Regression

Analysis By Examples, Fifth Edition, New York:

Wiley.

• Rousseeuw, P. J. and Leroy, A. (1987), Robust

Regression and Outlier Detection, New York:

Wiley.

1.26

Selected References: Selected Articles

Gould, W. and Hadi, A. S. (1993), “Identifying

Multivariate Outliers,” Stata Technical Bulletin,

11, 2–5.

Hadi, A. S. (1992), “Identifying Multiple Outliers in

Multivariate Data,” Journal of the Royal Statistical

Society, (B), 54, No. 3, 761–771.

Hadi, A. S. (1992), “A New Measure of Overall

Potential Influence in Linear Regression,”

Computational Statistics and Data Analysis, 14, 1–

27.

Page 14: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.27

Selected References: Selected Articles

Hadi, A. S. (1994), “A Modification of a Method for

the Detection of Outliers in Multivariate

Samples,” Journal of the Royal Statistical Society,

Series (B), 56, 393–396.

Hadi, A. S. and Simonoff , J. S. (1993), “Procedures

for the Identification of Multiple Outliers in

Linear Models,” Journal of the American

Statistical Association, 88, 1264–1272.

1.28

Selected References: Articles

Hadi, A. S. and Simonoff, J. S. (1994), “Improving

the Estimation and Outlier Identification

Properties of the Least Median of Squares and

Minimum Volume Ellipsoid Estimators,”

Parisankhyan Sammikkha, 1, 61–70.

Hadi, A. S. and Simonoff, J. S. (1997), “A More

Robust Outlier Identifier for Regression Data,”

Bulletin of the International Statistical Institute,

281–282.

Munier, S. (1999), “Multiple Outlier Detection in

Logistic Regression,” Student, 3, 117 – 126.

Page 15: Robust Regression Diagnostics · Motivating Example 2 1.8 ... regression analysis. Let us apply some of these methods to ... •Chatterjee, S. and Hadi (2006), Regression

1.29

Course Outline

1. Motivating Examples

2. Selected References

3. Review of Least Squares (LS)

Regression Analysis

4. The Iterative Nature of Regression

Analysis

5. The Projection Matrix and its Properties