robust regression diagnostics · motivating example 2 1.8 ... regression analysis. let us apply...
TRANSCRIPT
1.1
Robust Regression Diagnostics
A Graduate Course Presented at theFaculty of Economics and Political
Sciences, Cairo University
Professor Ali S. Hadi
The American University in Cairo
and Cornell University
www.ilr.cornell.edu/~hadi
Copyright © 2017 by Ali S. Hadi
1.2
Regression Analysis
Input Computer Output
Data
Model
Fitting Method
Assumptions
Estimated Parameters
Test Statistics
Graphs
Tables
We like to know how sensitive the output
is to small perturbation in the input.
1.3
Motivating Example 1
New York Rivers Data:
In a 1976 study on land use and water quality in New York rivers, the total nitrogen content was used as a measure of water quality in the 20 New York State river basins.
1.4
New York Rivers
1. Olean 2. Cassadaga3. Oatka 4. Neversink5. Hackensack 6. Wappinger7. Fishkill 8. Honeoye9. Susquehanna 10.Chenango11. Tioughnioga 12.West Canada13. East Canada 14.Saranac15. Ausable 16. Black17. Schoharie 18. Raquette19. Oswegatchie 20.Cohocton
See map of NY State
1.5
Variables Used
Active Agriculture (X1): percentage of land area currently in agricultural use
Forest (X2): percentage of land area in forest
Residential (X3): percentage of residential land area
Commercial/Industrial (X4): percentage of land area used in either commercial or manufacturing
Total Nitrogen (Y): mean concentration (mg/liter)
based on samples taken at regular intervals during the spring, summer, and fall months
1.6
River X1 X2 X3 X4 Y
1 26 63 1.2 0.29 1.102 29 57 0.7 0.09 1.013 54 26 1.8 0.58 1.904 2 84 1.9 1.98 1.005 3 27 29.4 3.11 1.996 19 61 3.4 0.56 1.427 16 60 5.6 1.11 2.048 40 43 1.3 0.24 1.659 28 62 1.1 0.15 1.0110 26 60 0.9 0.23 1.2111 26 53 0.9 0.18 1.3312 15 75 0.7 0.16 0.7513 6 84 0.5 0.12 0.7314 3 81 0.8 0.35 0.8015 2 89 0.7 0.35 0.7616 6 82 0.5 0.15 0.8717 22 70 0.9 0.22 0.8018 4 75 0.4 0.18 0.8719 21 56 0.5 0.13 0.6620 40 49 1.1 0.13 1.25
1.7
Regression Summary
Observation Deleted
T-value None 5 4 7
t0 1.40 2.08 1.21 1.77
t1 0.39 0.25 0.92 0.68t2 -0.93 -1.45 -0.74 -1.13t3 -0.21 4.08 -3.15 0.08
t4 1.86 0.66 4.45 1.83
.4,3,2,1,0;)ˆ.(.
ˆ j
est
j
j
j
1.8
Motivating Example 2
Homicides Data:
This data set is a result of a study
investigating the role of firearms in
accounting for the rising homicide rate in
Detroit. The data is for the years 1961-
1973.
1.9
Variables Used
FTP: # of full-time police per 100,000 population
UEMP: % of the population unemployed
MAN: # of manufacturing workers (in thousands)
LIC: # of handgun licenses issued per 100,000 population
CLEAR: Percent of homicides cleared by arrest
WM: # of white males in the population
GOV: # of government workers (in thousands)
HOM: # of homicides per 100,000 population
1.10
Estimated Coefficient (T-value)
Coef. Model 1 Model 2 Model 3
Const. 199.31 252.59 -20.05(2.4) (11.1) (-1.5)
MAN -0.13 -0.14 -0.09(-4.5) (-5.4) (-2.8)
WM -0.00 -0.00(-2.7) (-15.9)
GOV 0.10 0.51(0.7) (12.0)
1.11
Model Selection Criteria
Minimum Residual Mean Square (RMS):
where SSE = is the residual
sum of squares, n is the number of
observations, p is the number of
regression coefficients.
,ˆ 2
pn
SSE
n
iii yy
1
2)ˆ(
1.12
Model Selection Criteria
Maximum R-Square:
where SST is the total sum of
squares.
Note: Not good for comparing models with
different number of predictors.
,12
SST
SSER
n
ii yy
1
2)(
1.13
Model Selection Criteria
Maximum Adjusted R-Square:
where SST is the total sum of
squares.
Note: The sum of squares are adjusted for
their degrees of freedom. It imposes a
penalty for including insignificant variables.
,)1/(
)/(12
nSST
pnSSERa
n
ii yy
1
2)(
1.14
Model Selection Criteria
Mallows C-p: For a model with p predictors,
where is a good estimate of 2 (usually
obtained from the full model).
Note: The above are standard well-known
criteria, used to judge the adequacy of fit and
to guide variable selection procedures.
),2(ˆ
Y)PI(Y2
npC
T
p
2̂
1.15
Variable Selection Methods
Backward Elimination:Start with the full model, then delete the least significant variable (the one with the smallest T-value or largest p-value).
Repeat until all regression coefficients in the model are significant.
1.16
Variable Selection Methods
Forward Selection:Start with the empty model, then add the most significant variable (the one with the largest T-value or smallest p-value).
Repeat until all candidate variables to enter the model have insignificant regression coefficients.
1.17
Variable Selection Methods
Stepwise Method:A combination of the Backward and Forward methods.
Other Methods: See any textbook on regression analysis.
Let us apply some of these methods to the Homicides Data.
1.18
Backward Elimination Method
Variable RMS AdjustedRemoved
None 9 0.97
GOV 9 0.97
MAN 30 0.89
WM 268 0.69
2̂
Accordingly, GOV is the least importantvariable.
2aR
1.19
Forward Selection Method
GOV 24 0.91
MAN 15 0.94
WM 9 0.97
Accordingly, GOV is the most important variable.
Variable RMS AdjustedAdded 2̂ 2
aR
1.20
Reasons for Inconsistency
GOV
MAN
GOV
WM
1.21
Summary
Conclusions drawn from fitted models
that are highly sensitive to a particular
data point, a particular variable, or a
particular assumption should be treated
cautiously.
1.22
Course Outline
1. Motivating Examples
2. Selected References
3. Review of Least Squares (LS)
Regression Analysis
4. The Iterative Nature of Regression
Analysis
5. The Projection Matrix and its Properties
1.23
Course Outline
6. Sensitivity of the LS fit with Respect to:
• Variables (column sensitivity)
• Observations (row sensitivity)
• Errors of Measurements
• Probability Law of Errors
1.24
Course Outline
7. Robust Regression and Outlier Detection:
• The Brute Force Method
• The LMS Method
• The LAV Method
• The BACON Approach
• The RIRLS Method
1.25
Selected References: Selected Books
• Birkes, D. and Dodge, Y. (1993), Alternative
Methods of Regression, New York: Wiley.
• Chatterjee, S. and Hadi, A.S. (1988), Sensitivity
Analysis in Linear Regression, New York: Wiley.
• Chatterjee, S. and Hadi (2006), Regression
Analysis By Examples, Fifth Edition, New York:
Wiley.
• Rousseeuw, P. J. and Leroy, A. (1987), Robust
Regression and Outlier Detection, New York:
Wiley.
1.26
Selected References: Selected Articles
Gould, W. and Hadi, A. S. (1993), “Identifying
Multivariate Outliers,” Stata Technical Bulletin,
11, 2–5.
Hadi, A. S. (1992), “Identifying Multiple Outliers in
Multivariate Data,” Journal of the Royal Statistical
Society, (B), 54, No. 3, 761–771.
Hadi, A. S. (1992), “A New Measure of Overall
Potential Influence in Linear Regression,”
Computational Statistics and Data Analysis, 14, 1–
27.
1.27
Selected References: Selected Articles
Hadi, A. S. (1994), “A Modification of a Method for
the Detection of Outliers in Multivariate
Samples,” Journal of the Royal Statistical Society,
Series (B), 56, 393–396.
Hadi, A. S. and Simonoff , J. S. (1993), “Procedures
for the Identification of Multiple Outliers in
Linear Models,” Journal of the American
Statistical Association, 88, 1264–1272.
1.28
Selected References: Articles
Hadi, A. S. and Simonoff, J. S. (1994), “Improving
the Estimation and Outlier Identification
Properties of the Least Median of Squares and
Minimum Volume Ellipsoid Estimators,”
Parisankhyan Sammikkha, 1, 61–70.
Hadi, A. S. and Simonoff, J. S. (1997), “A More
Robust Outlier Identifier for Regression Data,”
Bulletin of the International Statistical Institute,
281–282.
Munier, S. (1999), “Multiple Outlier Detection in
Logistic Regression,” Student, 3, 117 – 126.
1.29
Course Outline
1. Motivating Examples
2. Selected References
3. Review of Least Squares (LS)
Regression Analysis
4. The Iterative Nature of Regression
Analysis
5. The Projection Matrix and its Properties