applied linear regression cstat workshop march 16, 2007 vince melfi

69
Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

Post on 19-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

Applied Linear Regression

CSTAT WorkshopMarch 16, 2007

Vince Melfi

Page 2: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

References

• “Applied Linear Regression,” Third Edition by Sanford Weisberg.

• “Linear Models with R,” by Julian Faraway.

• Countless other books on Linear Regression, statistical software, etc.

Page 3: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

Statistical Packages

• Minitab (we’ll use this today)

• SPSS

• SAS

• R

• Splus

• JMP

• ETC!!

Page 4: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

Outline

I. Simple linear regression review

II. Multiple Regression: Adding predictors

III. Inference in Regression

IV. Regression Diagnostics

V. Model Selection

Page 5: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

5

Savings Rate Data

Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate.

• SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.)

• Pop>75: Percent of the population over 75 years old. (One of the predictors.)

Page 6: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

6

543210

20

15

10

5

0

pop>75

SaveRate

Scatterplot of SaveRate vs pop>75

Page 7: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

7

Regression Output

The regression equation isSaveRate = 7.152 + 1.099 pop>75

S = 4.29409 R-Sq = 10.0% R-Sq(adj) = 8.1%

Analysis of Variance

Source DF SS MS F PRegression 1 98.545 98.5454 5.34 0.025Error 48 885.083 18.4392Total 49 983.628

Fitted model

R2 (coeff. of determination)

Testing the model

Page 8: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

Importance of Plots

• Four data sets

• All have – Regression line Y = 3 + 0.5 x– R2 = 66.7%– S = 1.24– Same t statistics, etc., etc.

• Without looking at plots, the four data sets would seem similar.

Page 9: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

9

Importance of Plots (1)

15.012.510.07.55.0

11

10

9

8

7

6

5

4

x1

y1

S 1.23660R-Sq 66.7%R-Sq(adj) 62.9%

Fitted Line Ploty1 = 3.000 + 0.5001 x1

Page 10: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

10

Importance of Plots (2)

15.012.510.07.55.0

10

9

8

7

6

5

4

3

x1

y2

S 1.23721R-Sq 66.6%R-Sq(adj) 62.9%

Fitted Line Ploty2 = 3.001 + 0.5000 x1

Page 11: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

11

Importance of Plots (3)

15.012.510.07.55.0

13

12

11

10

9

8

7

6

5

4

x1

y3

S 1.23631R-Sq 66.6%R-Sq(adj) 62.9%

Fitted Line Ploty3 = 3.002 + 0.4997 x1

Page 12: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

12

Importance of Plots (4)

2018161412108

13

12

11

10

9

8

7

6

5

x2

y4

S 1.23570R-Sq 66.7%R-Sq(adj) 63.0%

Fitted Line Ploty4 = 3.002 + 0.4999 x2

Page 13: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

13

The model

• Yi = β0 + β1xi + ei, for i = 1, 2, …, n

• “Errors” e1, e2, …, en are assumed to be independent.

• Usually e1, e2, …, en are assumed to have the same standard deviation, σ.

• Often e1, e2, …, en are assumed to be normally distributed.

Page 14: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

14

Least Squares

• The regression line (line of best fit) is based on “least squares.”

• The regression line is the line that minimizes the sum of the squared deviations from the data.

• The least squares line has certain optimality properties.

• The least squares line is denoted

iii eXY ˆˆˆ

Page 15: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

15

Residuals

• The residuals represent the difference between the data and the least squares line:

iii YYe ˆˆ

1 2 3 4 5 6 7

45

67

89

10

X

Y

Page 16: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

16

Checking assumptions

• Residuals are the main tool for checking model assumptions, including linearity and constant variance.

• Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance.

• Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption.

Page 17: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

17

Page 18: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

18

Page 19: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

19

Page 20: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

20

Page 21: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

21

1050-5-10

99

90

50

10

1

Residual

Perc

ent

12111098

10

5

0

-5

-10

Fitted Value

Resi

dual

1050-5-10

16

12

8

4

0

Residual

Fre

quency

50454035302520151051

10

5

0

-5

-10

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for SaveRate

“Four in one” plot from Minitab

Page 22: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple Linear Regression Review

22

Coefficient of determination (R2)

Residual sum of squares, aka sum of squares for error:

Total sum of squares:

Coefficient of determination:

n

i ieSSERSS1

2

1)( yyTSSSST

n

i i

TSS

RSSTSSR

2

Page 23: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

I. Simple linear regression review

23

R2

• The coefficient of determination, R2, measures the proportion of the variability in Y that is explained by the linear relationship with X.

• It’s also the square of the Pearson correlation coefficient

Page 24: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

24

Adding a predictor

• Recall: Fitted model was SaveRate = 7.152 + 1.099 pop>75 (p-value for test of whether pop>75 is

significant was 0.025.)• Another predictor: DPI (per-capita income)• Fitted model: SaveRate = 8.57 + 0.000996 DPI (p-value for DPI: 0.124)

Page 25: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

25

Adding a predictor (2)

• Model with both pop>75 and DPI is SaveRate = 7.06 + 1.30 pop>75 - 0.00034 DPI

• p-values are 0.100 and 0.738 for pop>75 and DPI

• The sign of the coefficient of DPI has changed!

• pop>75 was significant alone, but neither it nor DPI are significant together!

Page 26: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

26

Adding a predictor (3)

40003000200010000

5

4

3

2

1

0

DPI

pop>

75

S 0.804599R-Sq 61.9%R-Sq(adj) 61.1%

Fitted Line Plotpop>75 = 1.158 + 0.001025 DPI

•What happened??

•The predictors pop>75 and DPI are highly correlated

Page 27: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

27

Added variable plots and partial correlation

1. Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75.

2. Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75.

3. A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75. This is called an “added variable plot.”

4. The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75.

Page 28: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

28

Added variable plot

25002000150010005000-500-1000

15

10

5

0

-5

-10

RESDPIvspop>75

RES

SRvsp

op>

75

S 4.28891R-Sq 0.2%R-Sq(adj) 0.0%

Fitted Line PlotRESSRvspop>75 = 0.0000 - 0.000341 RESDPIvspop>75

Note that the slope term,

-0.000341, is the same as the slope term for DPI in the two-predictor model

Page 29: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

29

Scatterplot matrices (Matrix Plots)

• With one predictor X, a scatterplot of Y vs. X is very informative.

• With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed.

• A scatterplot matrix (or matrix plot) is just an organized display of the plots

Page 30: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

30

20

10

0

403020 400020000

40

30

204

2

0 4000

2000

0

20100

16

8

0

420 1680

SaveR

ate

pop<

15

pop>

75

DPI

SaveRate

changeD

PI

pop<15 pop>75 DPI changeDPI

Matrix Plot of SaveRate, pop<15, ... vs SaveRate, pop<15, ...

Page 31: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

31

Changes in R2

• Consider adding a predictor X2 to a model that already contains the predictor X1

• Let R2,1 be the R2 value for the fit of Y vs. X1, and let R2,2 be the R2 value for the fit of Y vs. X2

Page 32: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

32

Changes in R2 (2)

• The R2 value for the multiple regression fit is always larger than R2,1 and R2,2

• The R2 value for the multiple regression fit of Y versus X1 and X2 may be– less than R2,1 + R2,2 (if the two predictors are

explaining the same variation)– equal to R2,1 + R2,2 (if the two predictors measure

different things)– more than R2,1 + R2,2 (e.g. Response is area of

rectangle, and the two predictors are length and width)

Page 33: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

II. Multiple regression: Adding predictors

33

Multiple regression model• Response variable Y

• Predictors X1, X2, …, Xp

ipipiii eXXXY ...21

•Same assumptions on errors ei

(independent, constant variance, normality)

Page 34: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

34

Inference in regression

• Most inference procedures assume independence, constant variance, and normality of the errors.

• Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold.

• In general, techniques like the bootstrap can be used when normality is suspect.

Page 35: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

35

New data set

• Response variable: – Fuel = per-capita fuel consumption (times 1000)

• Predictors:– Dlic = proportion of the population who are licensed

drivers (times 1000)– Tax = gasoline tax rate– Income = per person income in thousands of dollars– logMiles = base 2 log of federal-aid highway miles in

the state

Page 36: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

36

t tests• Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles

• The regression equation is• Fuel = 154 - 4.23 Tax + 0.472 Dlic - 6.14 Income +

18.5 logMiles

• Predictor Coef SE Coef T P• Constant 154.2 194.9 0.79 0.433• Tax -4.228 2.030 -2.08 0.043• Dlic 0.4719 0.1285 3.67 0.001• Income -6.135 2.194 -2.80 0.008• logMiles 18.545 6.472 2.87 0.006

t statisticsp values

Page 37: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

37

t tests (2)

• The t statistics tests the hypothesis that a particular slope parameter is zero.

• The formula is

t = (coefficient estimate)/(standard error)

• degrees of freedom are n-(p+1)

• p-values given are for the two-sided alternative

• This is like simple linear regression

Page 38: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

38

F tests• General structure:

– Ha: Large model– H0: Smaller model, obtained by setting some

parameters in the large model to zero, or equal to each other, or equal to a constant

– RSSAH = resid. sum of squares after fitting the large (alt. hypothesis) model

– RSSNH = resid. sum of squares after fitting the smaller (null hypothesis) model

– dfNH and dfAH are the corresponding degrees of freedom

Page 39: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

39

F tests (2)

• Test statistic:

AH

AH

AHNH

AHNH

dfRSS

dfdfRSSRSS

F)(

)(

•Null distribution: F distribution with dfNH – dfAH numerator and dfAH denominator degrees of freedom

Page 40: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

40

F test example

• Can the “economic” variables tax and income be dropped from the model with all four predictors?

• AH model includes all predictors

• NH model includes only Dlic and logMiles

• Fit both models and get RSS and df values

Page 41: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

41

F test example (2)

• RSSAH = 193700; dfAH = 46

• RSSNH = 243006; dfNH = 48

85.546/193700

)4648/()193700243006(

F

•P-value is the area to the right of 5.85 under a F(2,46) distribution, approx. 0.0054

•There’s pretty strong evidence that removing both Tax and Income is unwise

Page 42: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

42

Another F test example

• Question: Does it make sense that the two “economic” predictors should have the same coefficient?

• Ha: Y = β0 + β1Tax + β2 Dlic+ β3 Income + β4 logMiles + error

• H0: Y = β0 + β1Tax + β2 Dlic+ β1 Income + β4 logMiles + error

• Note: H0: Y = β0 + β1 (Tax + Income)+ β2 Dlic + β4 logMiles + error

Page 43: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

43

Another F test example (2)

• Fit full model (AH)• Create new predictor “TI” by adding Tax and

Income, and fit a model with TI and Dlic and logMiles (NH)

424.046/193700

)4647/()193700195487(

F

•P-value is the area to the right of 5.85 under a F(1,46) distribution, approx. 0.518•This suggests that the simpler model with the same coefficient for Tax and Income fits well.

Page 44: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

44

Removing one predictor

• We have two ways to test whether one predictor can be removed from the model:– t test– F test

• The tests are equivalent, in the sense that t2 = F, and that the p-values will be equivalent.

Page 45: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

45

Confidence regions

• Confidence intervals for one parameter use the familiar t-interval.

• For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model:

• -6.135 ± (2.013)(2.194) = -6.135 ± 4.417.

From Minitab outputFrom t distribution with 46 df

Page 46: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

46

Joint confidence regions

• Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution.

• Minitab (and SPSS, and …) can’t draw these easily

• On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R.

Page 47: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

47

-8 -6 -4 -2 0

0.0

0.2

0.4

0.6

0.8

1.0

Tax

Dlic

Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two.

(0,0)

Boundary of confidence region

Page 48: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

48

Prediction

• Given a new set of predictor values x1, x2, …, xp, what’s the predicted response?

• It’s easy to answer this: Just plug the new predictors into the fitted regression model:

ppxxxY ˆ...ˆˆˆˆ21

•But how do we assess the uncertainty in the prediction? How do we form a confidence interval?

Page 49: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

III. Inference in regression

49

Predicted Values for New Observations

New

Obs Fit SE Fit 95% CI 95% PI

1 613.39 12.44 (588.34, 638.44) (480.39, 746.39)

Values of Predictors for New Observations

New

Obs Dlic Income logMiles Tax

1 900 28.0 15.0 17.0

Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17

Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17

Page 50: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

IV. Regression Diagnostics

50

Diagnostics

• Want to look for points that have a large influence on the fitted model

• Want to look for evidence that one or more model assumptions are untrue.

• Tools:– Residuals– Leverage– Influence and Cook’s Distance

Page 51: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

IV. Regression Diagnostics

51

Leverage

• A point whose predictor values are far from the “typical” predictor values has high leverage.

• For a high leverage point, the fitted value

will be close to the data value Yi.

• A rule of thumb: Any point with leverage larger than 2(p+1)/n is interesting.

• Most statistical packages can compute leverages.

iY

Page 52: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

IV. Regression Diagnostics

52

15.012.510.07.55.0

13

12

11

10

9

8

7

6

5

4

x1

y3

0.236364

0.127273

0.172727

0.318182

0.172727

0.318182

0.127273

0.090909

0.236364

0.100000

0.100000

Scatterplot with leverages

Page 53: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

IV. Regression Diagnostics

53

50403020100

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Index

Levera

ge

0.2

Malaysia

Libya

Uruguay

Jamaica

ZambiaVenezuela

UnitedStates

UnitedKingdom

Tunisia

Turkey

Switzerland

Sweden

Spain

SouthRhodesia

SouthAfrica

Portugal

PhilippinesPeruParaguayPanama

NicaraguaNewZealand

Netherlands

Norway

MaltaLuxembourgKorea

Japan

Italy

Ireland

IndiaIcelandHondurasGuatamala

GreeceGermany

France

FinlandEcuadorDenmark

CostaRicaColombiaChina

Chile

Canada

BrazilBoliviaBelgium

Austria

Australia

Scatterplot of Leverage vs Index

Page 54: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

IV. Regression Diagnostics

54

Influential Observations

• A data point is influential if it has a large effect on the fitted model.

• Put another way, an observation is influential if the fitted model will change a lot if the observation is deleted.

• Cook’s Distance is a measure of the influence of an observation.

• It may make sense to refit the model after removing a few of the most influential observations.

Page 55: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

IV. Regression Diagnostics

55

15.012.510.07.55.0

13

12

11

10

9

8

7

6

5

4

x1

y3

0.00695

0.00035

0.05954

0.03382

0.00052

0.30057

0.02598

0.00547

1.39285

0.00214

0.01176

Scatterplot with Cook's Distance (measure of influence)

High leverage, low influence High Influence

Page 56: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

IV. Regression Diagnostics

56

50403020100

0.30

0.25

0.20

0.15

0.10

0.05

0.00

Index

Cook'

s Dis

tance

Malaysia

Libya

UruguayJamaica

Zambia

VenezuelaUnitedStatesUnitedKingdomTunisiaTurkeySwitzerland

Sweden

SpainSouthRhodesiaSouthAfricaPortugal

PhilippinesPeruParaguay

PanamaNicaraguaNewZealandNetherlandsNorway

MaltaLuxembourg

Korea

Japan

Italy

Ireland

India

Iceland

HondurasGuatamalaGreece

GermanyFrance

FinlandEcuador

DenmarkCostaRica

ColombiaChina

Chile

CanadaBrazil

BoliviaBelgiumAustriaAustralia

Scatterplot of Cook's Distance vs Index

Page 57: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 57

Model Selection

• Question: With a large number of potential predictors, how do we choose the predictors to include in the model?

• Want good prediction, but parsimony: Occam’s Razor.

• Also can be thought of as a bias-variance tradeoff.

Page 58: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 58

Model Selection Example

• Data on all 50 states, from the 1970s– Life.Exp = Life expectancy (response)– Population (in thousands)– Income = per-capita income– Illiteracy (in percent of population)– Murder = murder rate per 100,000– HS.Grad (in percent of population)– Frost = mean # days with min. temp < 32F– Area = land area in square miles

Page 59: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 59

Forward Selection

• Choose a cutoff α

• Start with no predictors

• At each step, add the predictor with the lowest p-value less than α

• Continue until there are no unused predictors with p-values less than α

Page 60: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 60

• Stepwise Regression: Life.Exp versus Population, Income, ...

• Forward selection. Alpha-to-Enter: 0.25

• Response is Life.Exp on 7 predictors, with N = 50

• Step 1 2 3 4• Constant 72.97 70.30 71.04 71.03

• Murder -0.284 -0.237 -0.283 -0.300• T-Value -8.66 -6.72 -7.71 -8.20• P-Value 0.000 0.000 0.000 0.000

• HS.Grad 0.044 0.050 0.047• T-Value 2.72 3.29 3.14• P-Value 0.009 0.002 0.003

• Frost -0.0069 -0.0059• T-Value -2.82 -2.46• P-Value 0.007 0.018

• Population 0.00005• T-Value 2.00• P-Value 0.052

• S 0.847 0.796 0.743 0.720• R-Sq 60.97 66.28 71.27 73.60• R-Sq(adj) 60.16 64.85 69.39 71.26• Mallows Cp 16.1 9.7 3.7 2.0

Page 61: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 61

Variations on FS

• Backward elimination– Choose cutoff α– Start with all predictors in the model– Eliminate the predictor with the highest p-

value that is greater than α– ETC

• Stepwise: Allow addition or elimination at each step (hybrid of FS and BE)

Page 62: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 62

All subsets

• Fit all possible models.

• Based on a “goodness” criterion, choose the model that fits best.

• Goodness criteria include AIC, BIC, Adjusted R2, Mallow’s Cp

• Some of the criteria will be described next

Page 63: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 63

Notation

• RSS* = Resid. Sum of Squares for the current model

• p* = Number of terms (including intercept) in the current model

• n = number of observations

• s2 = RSS/(n-(p+1)) = Estimate of σ2 from model with all predictors and intercept term.

Page 64: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 64

Goodness criteria

• Smaller is better for AIC, BIC, Cp*. Larger is better for adjR2

• AIC = n log(RSS*/n) + 2p*• BIC = n log(RSS*/n) + p* log(n)

• Cp* = RSS*/s2 + 2p* - n

• adjR2 = )1(

)1(

11 2R

pn

n

Page 65: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 65

• Best Subsets Regression: Life.Exp versus Population, Income, ...

• Response is Life.Exp

• P I• o l• p l• u i H• l I t M S• a n e u . F• t c r r G r A• i o a d r o r• Mallows o m c e a s e• Vars R-Sq R-Sq(adj) Cp S n e y r d t a• 1 61.0 60.2 16.1 0.84732 X• 2 66.3 64.8 9.7 0.79587 X X• 3 71.3 69.4 3.7 0.74267 X X X• 4 73.6 71.3 2.0 0.71969 X X X X• 5 73.6 70.6 4.0 0.72773 X X X X X• 6 73.6 69.9 6.0 0.73608 X X X X X X• 7 73.6 69.2 8.0 0.74478 X X X X X X X

Page 66: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 66

Model selection can overstate significance

• Generate Y and X1, X2, …, X50

• All are independent and standard normal.• So none of the predictors are related to

the response.

• Fit the full model and look at the overall F test.

• Use model selection to choose a “good” smaller model, and look at its overall F test

Page 67: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 67

The full model

• Results from fitting model with all 50 predictors

• Note that the F test is not significant

• S = 0.915237 R-Sq = 57.6% R-Sq(adj) = 14.3%

• Analysis of Variance

• Source DF SS MS F P• Regression 50 55.7093 1.1142 1.33 0.160• Residual Error 49 41.0453 0.8377• Total 99 96.7546

Page 68: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

V. Model Selection 68

The “good” small model

• Run FS with α = 0.05• Predictors x38, x41, and x24 are chosen.• Fit that three predictor model. Now the F test is

highly significant

• Analysis of Variance

• Source DF SS MS F P• Regression 3 20.9038 6.9679 8.82 0.000• Residual Error 96 75.8508 0.7901• Total 99 96.7546

Page 69: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

What’s left?

• Weighted least squares

• Tests for lack of fit

• Transformations of response and predictors

• Analysis of Covariance

• Etc.