lecture 25 multiple regression diagnostics (sections 19.4-19.5) polynomial models (section 20.2)

Lecture 25

• Multiple Regression Diagnostics (Sections 19.4-19.5)

• Polynomial Models (Section 20.2)

• The conditions required for the model assessment to apply must be checked.

– Is the error variable normally distributed?

– Is the regression function correctly specified as a linear function of x1,…,xk ( ) Plot the residuals versus x

and

– Is the error variance constant?

– Are the errors independent?

– Can we identify outliers and influential observations?– Is multicollinearity a problem?

19.4 Regression Diagnostics - II

Draw a histogram of the residuals

Plot the residuals versus y

Plot the residuals versus the time periods

y

0)( iE

Effects of Violated Assumptions

• Curvature ( ): slopes no longer meaningful

(Potential remedy: Transformations of responses and predictors)

• Violations of other assumptions: tests, p-values, CIs are no longer accurate. That is, inference is invalidated (Remedies may be difficult)

0)( iE j

Influential Observation

• Influential observation: An observation is influential if removing it would markedly change the results of the analysis.

• In order to be influential, a point must either be (i) an outlier in terms of the relationship between its y and x’s or

(ii) have unusually distant x’s (high leverage) and not fall exactly into the relationship between y and x’s that the rest of the data follows.

Simple Linear Regression Example

• Data in salary.jmp. Y=Weekly Salary, X=Years of Experience.Bivariate Fit of Weekly Salary By Years of Experience

0

100

200

300

400

500

600

700

Weekly

Sala

ry

0 5 1015202530354045

Years of Experience

Identification of Influential Observations

• Cook’s distance is a measure of the influence of a point – the effect that omitting the observation has on the estimated regression coefficients.

• Use Save Columns, Cook’s D Influence to obtain Cook’s Distance.

• Plot Cook’s Distances: Graph, Overlay Plot, put Cook’s D Influence in Y and leave X blank (plots Cook’

Cook’s Distance

• Rule of thumb: Observation with Cook’s Distance (Di) >1 has high influence. You should also be concerned about any observation that has Di<1 but has a much bigger Di than any other observation. Ex. 19.2:

0

0.05

0.1

0.15

0.2

0.25

Co

ok

's D

Infl

ue

nc

e P

ric

e

-10 0 102030405060708090 110

Rows

Overlay Plot

Strategy for dealing with influential observations/outliers

• Do the conclusions change when the obs. is deleted?– If No. Proceed with the obs. Included. Study the obs to

see if anything can be learned.– If Yes. Is there reason to believe the case belongs to a

population other than the one under investigation?• If Yes. Omit the case and proceed.• If No. Does the case have unusually “distant” independent

variables.

– If Yes. Omit the case and proceed. Report conclusions for the reduced range of explanatory variables.

– If No. Not much can be said. More data are needed to resolve the questions.

Multicollinearity

• Multicollinearity: Condition in which independent variables are highly correlated.

• Exact collinearity: Y=Weight, X1=Height in inches, X2=Height in feet. Then

provide the same predictions.• Multicollinearity causes two kinds of difficulties:

– The t statistics appear to be too small.– The coefficients cannot be interpreted as “slopes”.

21

21

05.25.1ˆ

245.5.1ˆ

XXY

XXY

Multicollinearity Diagnostics

• Diagnostics:– High correlation between independent variables– Counterintuitive signs on regression

coefficients– Low values for t-statistics despite a significant

overall fit, as measured by the F statistic.

Diagnostics: Multicollinearity

• Example 19.2: Predicting house price (Xm19-02) – A real estate agent believes that a house selling price can be

predicted using the house size, number of bedrooms, and lot size.

– A random sample of 100 houses was drawn and data recorded.

– Analyze the relationship among the four variables

Price Bedrooms H Size Lot Size124100 3 1290 3900218300 4 2080 6600117800 3 1250 3750

. . . .

. . . .

• The proposed model isPRICE = 0 + 1BEDROOMS + 2H-SIZE +3LOTSIZE +

The model is valid, but no variable is significantly relatedto the selling price ?!


Summary of Fit RSquare 0.559998 RSquare Adj 0.546248 Root Mean Square Error 25022.71 Mean of Response 154066 Observations (or Sum Wgts) 100 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio

Model 3 7.65017e10 2.5501e10 40.7269 Error 96 6.0109e+10 626135896 Prob > F

C. Total 99 1.36611e11 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept 37717.595 14176.74 2.66 0.0091 Bedrooms 2306.0808 6994.192 0.33 0.7423 House Size 74.296806 52.97858 1.40 0.1640 Lot Size -4.363783 17.024 -0.26 0.7982

• Multicollinearity is found to be a problem.Price Bedrooms H Size Lot Size

Price 1Bedrooms 0.6454 1H Size 0.7478 0.8465 1Lot Size 0.7409 0.8374 0.9936 1


• Multicollinearity causes two kinds of difficulties:– The t statistics appear to be too small.– The coefficients cannot be interpreted as “slopes”.

Remedying Violations of the Required Conditions

• Nonnormality or heteroscedasticity can be remedied using transformations on the y variable.

• The transformations can improve the linear relationship between the dependent variable and the independent variables.

• Many computer software systems allow us to make the transformations easily.

• A brief list of transformations» y’ = log y (for y > 0)

• Use when the s increases with y, or• Use when the error distribution is positively skewed

» y’ = y2

• Use when the s2 is proportional to E(y), or

• Use when the error distribution is negatively skewed

» y’ = y1/2 (for y > 0)• Use when the s2

is proportional to E(y)

» y’ = 1/y• Use when s2

increases significantly when y increases beyond some critical value.

Reducing Nonnormality by Transformations

Transformations, Example.

Durbin - Watson Test:Are the Errors Autocorrelated?• This test detects first order autocorrelation

between consecutive residuals in a time series• If autocorrelation exists the error variables are not

independent

Positive First Order Autocorrelation

++

+

+

+

++ Residuals

Time

Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Then,the value of d is small (less than 2).

0

+

Negative First Order Autocorrelation

+

++

+

+

+

+0

Residuals

Time

Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ. Then, the value of d is large (greater than 2).

Durbin-Watson Test in JMP

• H0: No first-order autocorrelation.

H1: First-order autocorrelation• Use row diagnostics, Durbin-Watson test in

JMP after fitting the model.

• Autocorrelation is an estimate of correlation between errors.

Durbin-Watson Durbin-Watson Number of Obs. AutoCorrelation Prob<DW

0.5931403 20 0.5914 0.0002

• Example 19.3 (Xm19-03)– How does the weather affect the sales of lift tickets in a ski

resort?

– Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected.

– The model hypothesized was

TICKETS=0+1SNOWFALL+2TEMPERATURE+ – Regression analysis yielded the following results:

Testing the Existence of Autocorrelation, Example

20.1 Introduction

• Regression analysis is one of the most commonly used techniques in statistics.

• It is considered powerful for several reasons:– It can cover a variety of mathematical models

• linear relationships.

• non - linear relationships.

• nominal independent variables.

– It provides efficient methods for model building

Curvature: Midterm Problem 10B i v a r i a t e F i t o f M P G C i t y B y W e i g h t ( l b )

15

20

25

30

35

40MP

G Ci

ty

1500 2500 3000 3500 4000

W eight(lb)

-6

-22

610

Resid

ual

1500 2000 2500 3000 3500 4000

Weight(lb)

Remedy I: Transformations

• Use Tukey’s Bulging Rule to choose a transformation.

B i v a r i a t e F i t o f 1 / M P G C i t y B y W e i g h t ( l b )

0.03

0.04

0.05

0.06

0.07

1/MPG

City

1500 2500 3000 3500 4000

Weight(lb)

-0.010

0.000

0.010

Resid

ual

1500 2000 2500 3000 3500 4000

Weight(lb)

y = 0 + 1x1+ 2x2 +…+ pxp +

y = 0 + 1x + 2x2 + …+pxp +

Remedy II: Polynomial Models

Quadratic RegressionB i v a r i a t e F i t o f M P G C i t y B y W e i g h t ( l b )

15

20

25

30

35

40M

PG

City

1500 2500 3000 3500 4000

Weight(lb)

P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t |

I n t e r c e p t 4 0 . 1 6 6 6 0 8 0 . 9 0 2 2 3 1 4 4 . 5 2 < . 0 0 0 1 W e i g h t ( l b ) - 0 . 0 0 6 8 9 4 0 . 0 0 0 3 2 - 2 1 . 5 2 < . 0 0 0 1 ( W e i g h t ( l b ) - 2 8 0 9 . 5 ) ^ 2 0 . 0 0 0 0 0 3 4 . 6 3 4 e - 7 6 . 3 8 < . 0 0 0 1

-7-4-1

25

Res

idua

l

1500 2000 2500 3000 3500 4000

Weight(lb)

y01x

• First order model (p = 1)

y = 0 + 1x + 2x2 +

2 < 0 2 > 0

• Second order model (p=2)

Polynomial Models with One Predictor Variable

y = 0 + 1x + 2x2 + 3x3 +

3 < 0 3 > 0

• Third order model (p = 3)

Polynomial Models with One Predictor Variable

Interaction

• Two independent variables x1 and x2 interact if the effect of x1 on y is influenced by the value of x2.

• Interaction can be brought into the multiple linear regression model by including the independent variable x1* x2.

• Example: EducIQIQEducmeoInc **10*100*20001000ˆ

Interaction Cont.

• • “Slope” for x1=E(y|x1+1,x2)-E(y|x1,x2)= • • Is the expected income increase from an

extra year of education higher for people with IQ 100 or with IQ 130 (or is it the same)?

21322110 xxxxy

231 x

EducIQIQEducmeoInc **10*100*20001000ˆ

• First order model, two predictors,and interactiony = 0 + 1x1 + 2x2

+3x1x2 +

x1

X2 = 2

X2 = 3

X2 =10+2(1)] +[1+3(1)]x1

0+2(3)] +[1+3(3)]x1

0+2(2)] +[1+3(2)]x1

The two variables interact to affect the value of y.

• First order modely = 0 + 1x1 + 2x2 +

The effect of one predictor variable on y is independent of the effect of the other predictor variable on y.

x1

X2 = 1X2 = 2X2 = 3

0+2(1)] +1x10+2(2)] +1x10+2(3)] +1x1

Polynomial Models with Two Predictor Variables

Second order model withinteractiony = 0 + 1x1 + 2x2

+3x12 + 4x2

2+

y = [0+2(2)+4(22)]+ 1x1 + 3x12 +

Second order modely = 0 + 1x1 + 2x2

+ 3x12 + 4x2

2 +

X2 =1

X2 = 2

X2 = 3

y = [0+2(1)+4(12)]+ 1x1 + 3x12 +

x1

X2 =1

X2 = 2

X2 = 3y = [0+2(3)+4(32)]+ 1x1 + 3x1

2 +

Polynomial Models with Two Predictor Variables

5x1x2 +

lecture 25 multiple regression diagnostics (sections 19.4-19.5) polynomial models (section 20.2)

Documents