multiple linear regression v lecture 9 - github pages

22
Multiple Linear Regression V Model Diagnostics: Influential Points Non-Constant Variance & Transformation Regression with Both Quantitative and Qualitative Predictors Polynomial Regression 9.1 Lecture 9 Multiple Linear Regression V Reading: Chapter 13 STAT 8020 Statistical Methods II September 17, 2020 Whitney Huang Clemson University

Upload: others

Post on 08-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.1

Lecture 9Multiple Linear Regression VReading: Chapter 13

STAT 8020 Statistical Methods IISeptember 17, 2020

Whitney HuangClemson University

Page 2: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.2

Agenda

1 Model Diagnostics: Influential Points

2 Non-Constant Variance & Transformation

3 Regression with Both Quantitative and QualitativePredictors

4 Polynomial Regression

Page 3: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.3

Leverage

Recall in MLR that Y = X(XTX)−1XTY = HY where H isthe hat-matrix

The leverage value for the ith observation is defined as:

hi = Hii

Can show that Var(ei) = σ2(1− hi), where ei = Yi − Yi isthe residual for the ith observation

1n ≤ hi ≤ 1, 1 ≤ i ≤ n and h =

∑ni=1

hi

n = pn ⇒ a “rule of

thumb" is that leverages of more than 2pn should be looked

at more closely

Page 4: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.4

Leverage Values of Species ∼ Elev + Adj

●●

●● ●● ● ●

●●

●●

●●

● ●●● ●● ● ●

0 500 1000 1500

0

1000

2000

3000

4000

5000

Elevation

Adj

acen

t

Page 5: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.5

Studentized Residuals

As we have seen Var(ei) = σ2(1− hi), this suggests the use ofri = ei

σ√

(1−hi)

ri’s are called studentized residuals. ri’s are sometimespreferred in residual plots as they have been standardizedto have equal variance.

If the model assumptions are correct then Var(ri) = 1 andCorr(ei, ej) tends to be small

Page 6: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.6

Studentized Residuals of Species ∼ Elev + Adj

●●

0 5 10 15 20 25 30

−2

−1

0

1

2

3

Studentized Residuals

r i

Page 7: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.7

Studentized Deleted Residuals

For a given model, exclude the observation i andrecompute β(i), σ(i) to obtain Yi(i)

The observation i is an outlier if Yi(i) − Yi is “large”

Can showVar(Yi(i) − Yi) = σ2

(i)

(1 + xTi (XT

(i)X(i))−1xi

)=

σ2(i)

1−hi

Define the Studentized Deleted Residuals as

ti =Yi(i) − Yiσ2(i)/1− hi

=Yi(i) − Yi

MSE(i)(1− hi)−1

which are distributed as a tn−p−1 if the model is correctand ε ∼ N(0, σ2I)

Page 8: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.8

Jackknife Residuals of Species ∼ Elev + Adj

●●

0 5 10 15 20 25 30

−2

−1

0

1

2

3

4

5

Jacknife Residuals

Page 9: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.9

Influential Observations

DFFITS

Difference between the fitted values Yi and the predictedvalues Yi(i)

DFFITSi =Yi−Yi(i)√MSE(i)hi

Concern if absolute value greater than 1 for small datasets, or greater than 2

√p/n for large data sets

Page 10: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.10

DFFITS of Species ∼ Elev + Adj

●●

●●

● ●

●●

16

25Threshold: 0.63

−1

0

1

0 10 20 30Observation

DF

FIT

S

Influence Diagnostics for Species

Page 11: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.11

Residual Plot of Species ∼ Elev + Adj

●●

0 100 200 300 400

−100

−50

0

50

100

150

200

Residuals

Y

e

Page 12: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.12

Residual Plot After Square Root Transformation

5 10 15 20

−4

−2

0

2

4

6

Residuals

Y

e

Page 13: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.13

Regression with Both Quantitative and Qualitative Predictors

Multiple Linear Regression

Y = β0 + β1X1 + β2X2 + · · ·+ βp−1Xp−1 + ε, ε ∼ N(0, σ2)

X1, X2, · · · , Xp−1 are the predictors.

Question: What if some of the predictors are qualitative(categorical) variables?

⇒We will need to create dummy (indicator) variables forthose categorical variables

Example: We can encode Gender into 1 (Female) and 0(Male)

Page 14: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.14

Salaries for Professors Data Set

The 2008-09 nine-month academic salary for AssistantProfessors, Associate Professors and Professors in acollege in the U.S. The data were collected as part of theon-going effort of the college’s administration to moni-tor salary differences between male and female facultymembers.

Page 15: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.15

Predictors

We have three categorical variables, namely, rank,discipline, and sex.

Page 16: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.16

Dummy Variable

For binary categorical variables:

Xsex =

{0 if sex = male,1 if sex = female.

Xdiscip =

{0 if discip = A,1 if discip = B.

For categorical variable with more than two categories:

Xrank1 =

{0 if rank = Assistant Prof,1 if rank = Associated Prof.

Xrank2 =

{0 if rank = Associated Prof,1 if rank = Full Prof.

Page 17: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.17

Design Matrix

With the design matrix X, we can now use method ofleast squares to fit the model Y = Xβ + ε

Page 18: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.18

Model Fit

Question: Interpretation of these dummy variables (e.g.βrankAssocProf)?

Page 19: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.19

lm(salary ∼ sex ∗ yrs.since.phd)

0 10 20 30 40 50

100

150

200

yrs.since.phd

MaleFemale

Page 20: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.20

Polynomial Regression

Suppose we would like to model the relationship betweenresponse Y and a predictor X as a pth degree polynomial in X:

Yi = β0 + β1Xi + β2X2i + · · ·+ βpX

pi + ε

We can treat polynomial regression as a special case ofmultiple linear regression. In specific, the design matrix takesthe following form:

X =

1 X1 X2

1 · · · Xp1

1 X2 X22 · · · Xp

2... · · ·

. . ....

...1 Xn X2

n · · · Xpn

Page 21: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.21

Housing Values in Suburbs of Boston Data Set

10 20 30

10

20

30

40

50

Lower status of the population (percent)

Med

ian

valu

e of

ow

ner−

occu

pied

hom

es

Page 22: Multiple Linear Regression V Lecture 9 - GitHub Pages

Multiple LinearRegression V

Model Diagnostics:Influential Points

Non-ConstantVariance &Transformation

Regression with BothQuantitative andQualitative Predictors

Polynomial Regression

9.22

Polynomial Regression Fits

10 20 30

10

20

30

40

50

Lower status of the population (percent)

Med

ian

valu

e of

ow

ner−

occu

pied

hom

es