lecture 14 diagnostics & remedial...

Lecture 14

Diagnostics & Remedial Measures

STAT 512

Spring 2011

Background Reading

KNNL: 6.8, 7.6, 10.5

Topic Overview

• Usual plots/tests to examine error

assumptions

• Multicollinearity

• CDI Case Study

Diagnostic (Residual) Plots

• Residuals vs. Normal Quantiles (Check

Normality)

• Residuals vs. Predicted Values (Check

Constant Variance)

• Residuals vs. Predictor Variables (Check

Linearity, Constant Variance)

• Residuals vs. Order of Observations (Check

Independence)

Diagnostic Tests

• Breusch-Pagan or Brown-Forsythe to test

for constancy of variance.

• Kolmogorov-Smirnov, etc. to test for

normality.

• Lack-of-fit test (we’ll hold off talking about

this one until we’ve discussed ANOVA)

Scatter Plot Matrix

• Plots Y, X1, X2, etc. against each of the

other variables.

• Compare Y to X’s to find relationships.

• Compare X’s to each other to identify

potential multicollinearity.

Remedial Measures

• Transform X if relationship non-linear

• Transform Y if violations of constant

variance and/or normality assumptions

• Use Box-Cox to come up with “best”

transformation on Y.

Multicollinearity (1)

• Definition: Intercorrelation exists

whenever the predictor variables are

correlated. The term multicollinearity is

generally reserved for instances where the

correlation is very high (greater than 0.9).

• Multicollinearity can make it difficult to...

� Judge relative importance of predictor variables.

� Ascertain the magnitude of an effect of a predictor

variable on the response.

Ideal Situation

• For a balanced design, absolutely no

intercorrelation exists. (For example, if

there are two predictor variables, 212 0r = ).

• Uncorrelated variables cannot overlap in the

variation in the response that they explain

• Type I and Type III SS will be identical.

• Slope estimates will also be the same.

Example

(p. 279) Eight observations on productivity based

on crew-size and bonus-size.

Productivity (Y) Crew Size (X1) Bonus Pay (X2)

42, 39 4 2

48, 51 4 3

49, 53 6 2

61, 60 6 3

Example (2)

Output from PROC GLM Source DF SS MS F P-value

Model 2 402.250 201.125 57 0.0004

Error 5 17.625 3.525

Total 7 419.875

Source DF Type I SS Mean Square

size 1 231.1250000 231.1250000

bonuspay 1 171.1250000 171.1250000

Source DF Type III SS Mean Square

size 1 231.1250000 231.1250000

bonuspay 1 171.1250000 171.1250000

Example (3)

Parameter EST SE T P-value

Intercept 0.375 4.74 0.08 0.9400

size 5.375 0.66 8.10 0.0005

bonuspay 9.250 1.33 6.97 0.0009

If we consider only size:

Source DF SS MS F P-val

Model 1 231.125 231.125 7.4 0.035

Error 6 188.750 31.458

Total 7 419.875 _

Parameter Est SE T P-value

Intercept 23.5 10.1 2.32 0.0591

size 5.375 1.98 2.71 0.0351

Perfect Correlation Example

(p. 281) Four observations in three-space, but

over the line X2 = 5 + 0.5 * X1.

X1 X2 Y

2 6 25

8 9 81

6 8 60

10 10 113

Perfect Correlation Example (2)

• Since points are exactly a line in two-space,

there are infinitely many regression planes

available. There is no unique best regression

plane.

• If you try to fit this in SAS, you will get output

that does not fit the full model (because it

cannot since ′X X is not invertible).

• SAS is “smart” enough to figure out that

something is wrong, and try to do something

about it.

Output

Source DF SS MS F P-val

Model 1 4007.15 4007.15 91.49 0.0108

Error 2 87.60 43.80

Total 3 4094.75

NOTE: Model is not full rank. Least-squares

solutions for the parameters are not unique.

Some statistics will be misleading. A

reported DF of 0 or B means that the

estimate is biased.

Output (2)

NOTE: The following parameters have been set

to 0, since the variables are a linear

combination of other variables as shown.

x2 = 5 * Intercept + 0.5 * x1

Parameter Std

Variable DF Estimate Error T P-value

Intercept B 0.20 7.99 0.03 0.9800

x1 B 10.70 1.12 9.56 0.0108

x2 0 0 . . .

Effects of Multicollinearity

• Variables are almost never 100% correlated.

• When there is a lot of intercorrelation....

� Can generally still obtain a “good fit”.

� Prediction made within the scope of the model

is generally unaffected.

� ′X X matrix has a near-zero determinant that

can be a source of serious round-off errors

Effects on Regression Coefficients

• Regression coefficients are highly correlated

and have large standard errors

• Cannot use the common interpretation of the

regression coefficient (since, for one thing,

probably isn’t feasible to hold other

variables constant).

Simultaneous T-tests

• Common abuse of multiple linear regression

models is to do simultaneous t-tests for testing

0kβ = , k = 1, 2, ...., p – 1

• The big problem with this is that these are all

MARGINAL (or variable-added-last) tests.

• If there were no intercorrelation, all of the

variables act “independently” and this would

be no problem. But when there is

intercorrelation, often would end up

incorrectly dropping ALL variables on this

basis.

Extra Sums of Squares

• When predictor variables are correlated, Type I

and Type III SS tend to be quite different

• Added first a variable may do a lot in terms of

explaining variation, but added later it may

not do much

Indicators of Multicollinearity

• Large simple correlations between pairs of

predictors.

• F-test says model is significant; Marginal T-

tests do not show any significance.

• Watch for Type I and Type III SS having

large differences.

• Large changes in estimated regression

coefficients when variables are

added/deleted.

Variance Inflation Factors

• Formal method for detecting multicollinearity

• VIF is related to the variance of the estimated

regression coefficients (think: variances get

“inflated” by having intercorrelation among

the predictors)

• 2kR is the coefficient of determination obtained

in regression of Xk on all other predictors.

Variance Inflation Factors (2)

• 2kR > 0.9 means that Xk is well predicted by

the other variables. This corresponds to VIF

of 10 or higher and indicates excessive

multicollinearity.

• Tolerance is defined as

21 1/k k

TOL R VIF= − =

• Tolerance below 0.01, 0.001, or 0.0001

typically raise concern.

Physicians Case Study (7.37)

• Goal: Predict # of active physicians in a

county (Y) from

1. X1 = Total Population

2. X2 = Total Personal Income

3. X3 = Land Area

4. X4 = Percent of Pop. Age 65 or older

5. X5 = # of hospital beds

6. X6 = Total Serious Crimes

• SAS code available in file CDI.sas.

Initial Model

• All six predictors included

• VIF and TOL can be used as options after

the ‘/’ in the model statement of REG

Variable DF Tolerance VIF _

tot_pop 1 0.01192 83.87229

tot_income 1 0.01883 53.10731

land_area 1 0.79952 1.25074

pop_elderly 1 0.94209 1.06147

beds 1 0.12251 8.16293

crimes 1 0.16763 5.96556

Residual Plots

Residual

Predicted Value of physicians

0 5000 10000 15000 20000 25000

Residual Plots

Resid vs. Total Population – See SAS for other variables

Residual

tot_pop

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000

Normal Probability Plot

-3 -2 -1 0 1 2 3

Residual

Normal Quantiles

Assumption Violations

• Errors not normal.

• Variance does not appear to be constant.

• BOXCOX suggests a log transformation,

which clears up some of the issues.

Normal Probability Plot

-3 -2 -1 0 1 2 3

Residual

Normal Quantiles

Two Outliers

• Can see these in the QQ-plot.

• Further investigation shows that they are for

Los Angeles County and Cook County

� Twice as many physicians than other counties.

� Also outliers in total population and total

income.

� There is reason to drop these two for the time

being, as it makes sense that such huge

counties should not be considered as the same

population as the rest.

QQ Plot w/o Outliers

-3 -2 -1 0 1 2 3

Residual

Normal Quantiles

Residual Plot

Residual

Predicted Value of lphysicians

5 6 7 8 9 10 11 12 13

Residual Plot

Resid vs. Total Population – See SAS for other variables

Residual

tot_pop

0 1000000 2000000 3000000

Still Problems?

• Normality is ok

• No other unreasonable outliers

• Residual Plot suggests some nonlinearity

• Look at Residual vs. Predictor Variable

Plots to learn more

• Possibly add some quadratic or other terms

• We’ve thus far ignored multicollinearity –

time to consider it.

Multicollinearity

• VIF’s for tot_pop and tot_income already

have informed us that there are problems.

pop inc l_ar eld beds

tot_pop 1.00 0.90.90.90.99999 0.17 -0.03 0.920.920.920.92

tot_income 0.90.90.90.99999 1.00 0.13 -0.02 0.900.900.900.90

land_area 0.17 0.13 1.00 0.01 0.07

pop_elderly -0.03 -0.02 0.01 1.00 0.05

beds 0.920.920.920.92 0.900.900.900.90 0.07 0.05 1.00

• Will continue analysis with Model Selection

Big Picture

• For checking basic assumptions: PLOTS

are generally easier to construct than

TESTS – and generally if there is

something to see, it will show up in the

appropriate plot.

• MULTICOLLINEARITY is a big issue

when trying to interpret estimates –

however it’s not really a problem for

prediction.

Upcoming in Lecture 15...

• Model Building: Selection Criteria (Ch 9)

• Continuing the Physicians Dataset Analysis

lecture 14 diagnostics & remedial...

Documents

lecture 7 remedial measures - purdue...

ba 201 lecture 14 multiple regression model. topics...

matlab tutorials - mit...16.62x matlab tutorials linear...

semester i - loyola college, chennai · 2020. 4. 22. ·...

factor forecasting with machine learning€¦ · training...

lecture 17 outliers & influential...

robust permutation tests for correlation · pdf filerobust...

how to interpret regression analysis results_ p-values and...

lecture 6 regression diagnostics - department of...

correlation coefficients for linear regression of density,...

find the least squares regression line and interpret its...

mixed models - random coefficients · mixed models –...

estimating the regression coefficients for simple linear...

robust estimation of common regression coefficients under...

maximum likelihood estimates of regression coefficients with...

inference about regression coefficients. 14.214.2 | 14.3 |...

jkuat project , coefficients - regression

testing a hypothesis relating to a regression coefficient...

regression analysis : how to calculate r-squared...

comparing regression coefficients between models using...