applied linear regression cstat workshop march 16, 2007 vince melfi

Applied Linear Regression

CSTAT WorkshopMarch 16, 2007

Vince Melfi

References

• “Applied Linear Regression,” Third Edition by Sanford Weisberg.

• “Linear Models with R,” by Julian Faraway.

• Countless other books on Linear Regression, statistical software, etc.

Statistical Packages

• Minitab (we’ll use this today)

• SPSS

• SAS

• R

• Splus

• JMP

• ETC!!

Outline

I. Simple linear regression review

II. Multiple Regression: Adding predictors

III. Inference in Regression

IV. Regression Diagnostics

V. Model Selection

I. Simple Linear Regression Review

5

Savings Rate Data

Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate.

• SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.)

• Pop>75: Percent of the population over 75 years old. (One of the predictors.)


6

543210

20

15

10

5

0

pop>75

SaveRate

Scatterplot of SaveRate vs pop>75


7

Regression Output

The regression equation isSaveRate = 7.152 + 1.099 pop>75

S = 4.29409 R-Sq = 10.0% R-Sq(adj) = 8.1%

Analysis of Variance

Source DF SS MS F PRegression 1 98.545 98.5454 5.34 0.025Error 48 885.083 18.4392Total 49 983.628

Fitted model

R2 (coeff. of determination)

Testing the model

Importance of Plots

• Four data sets

• All have – Regression line Y = 3 + 0.5 x– R2 = 66.7%– S = 1.24– Same t statistics, etc., etc.

• Without looking at plots, the four data sets would seem similar.


9

Importance of Plots (1)

15.012.510.07.55.0

11

10

9

8

7

6

5

4

x1

y1

S 1.23660R-Sq 66.7%R-Sq(adj) 62.9%

Fitted Line Ploty1 = 3.000 + 0.5001 x1


10


15.012.510.07.55.0

10

9

8

7

6

5

4

3

x1

y2

S 1.23721R-Sq 66.6%R-Sq(adj) 62.9%



11


15.012.510.07.55.0

13

12

11

10

9

8

7

6

5

4

x1

y3

S 1.23631R-Sq 66.6%R-Sq(adj) 62.9%



12


2018161412108

13

12

11

10

9

8

7

6

5

x2

y4

S 1.23570R-Sq 66.7%R-Sq(adj) 63.0%



13

The model

• Yi = β0 + β1xi + ei, for i = 1, 2, …, n

• “Errors” e1, e2, …, en are assumed to be independent.

• Usually e1, e2, …, en are assumed to have the same standard deviation, σ.

• Often e1, e2, …, en are assumed to be normally distributed.


14

Least Squares

• The regression line (line of best fit) is based on “least squares.”

• The regression line is the line that minimizes the sum of the squared deviations from the data.

• The least squares line has certain optimality properties.

• The least squares line is denoted

iii eXY ˆˆˆ


15

Residuals

• The residuals represent the difference between the data and the least squares line:

iii YYe ˆˆ

1 2 3 4 5 6 7

45

67

89

10

X

Y


16

Checking assumptions

• Residuals are the main tool for checking model assumptions, including linearity and constant variance.

• Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance.

• Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption.


17


18


19


20


21

1050-5-10

99

90

50

10

1

Residual

Perc

ent

12111098

10

5

0

-5

-10

Fitted Value

Resi

dual

1050-5-10

16

12

8

4

0

Residual

Fre

quency

50454035302520151051

10

5

0

-5

-10

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for SaveRate

“Four in one” plot from Minitab


22

Coefficient of determination (R2)

Residual sum of squares, aka sum of squares for error:

Total sum of squares:

Coefficient of determination:

n

i ieSSERSS1

2ˆ

2

1)( yyTSSSST

n

i i

TSS

RSSTSSR

2

I. Simple linear regression review

23

R2

• The coefficient of determination, R2, measures the proportion of the variability in Y that is explained by the linear relationship with X.

• It’s also the square of the Pearson correlation coefficient

II. Multiple regression: Adding predictors

24

Adding a predictor

• Recall: Fitted model was SaveRate = 7.152 + 1.099 pop>75 (p-value for test of whether pop>75 is

significant was 0.025.)• Another predictor: DPI (per-capita income)• Fitted model: SaveRate = 8.57 + 0.000996 DPI (p-value for DPI: 0.124)


25

Adding a predictor (2)

• Model with both pop>75 and DPI is SaveRate = 7.06 + 1.30 pop>75 - 0.00034 DPI

• p-values are 0.100 and 0.738 for pop>75 and DPI

• The sign of the coefficient of DPI has changed!

• pop>75 was significant alone, but neither it nor DPI are significant together!


26

Adding a predictor (3)

40003000200010000

5

4

3

2

1

0

DPI

pop>

75

S 0.804599R-Sq 61.9%R-Sq(adj) 61.1%

Fitted Line Plotpop>75 = 1.158 + 0.001025 DPI

•What happened??

•The predictors pop>75 and DPI are highly correlated


27

Added variable plots and partial correlation

1. Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75.

2. Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75.

3. A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75. This is called an “added variable plot.”

4. The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75.


28

Added variable plot

25002000150010005000-500-1000

15

10

5

0

-5

-10

RESDPIvspop>75

RES

SRvsp

op>

75

S 4.28891R-Sq 0.2%R-Sq(adj) 0.0%

Fitted Line PlotRESSRvspop>75 = 0.0000 - 0.000341 RESDPIvspop>75

Note that the slope term,

-0.000341, is the same as the slope term for DPI in the two-predictor model


29

Scatterplot matrices (Matrix Plots)

• With one predictor X, a scatterplot of Y vs. X is very informative.

• With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed.

• A scatterplot matrix (or matrix plot) is just an organized display of the plots


30

20

10

0

403020 400020000

40

30

204

2

0 4000

2000

0

20100

16

8

0

420 1680

SaveR

ate

pop<

15

pop>

75

DPI

SaveRate

changeD

PI

pop<15 pop>75 DPI changeDPI

Matrix Plot of SaveRate, pop<15, ... vs SaveRate, pop<15, ...


31

Changes in R2

• Consider adding a predictor X2 to a model that already contains the predictor X1

• Let R2,1 be the R2 value for the fit of Y vs. X1, and let R2,2 be the R2 value for the fit of Y vs. X2


32

Changes in R2 (2)

• The R2 value for the multiple regression fit is always larger than R2,1 and R2,2

• The R2 value for the multiple regression fit of Y versus X1 and X2 may be– less than R2,1 + R2,2 (if the two predictors are

explaining the same variation)– equal to R2,1 + R2,2 (if the two predictors measure

different things)– more than R2,1 + R2,2 (e.g. Response is area of

rectangle, and the two predictors are length and width)


33

Multiple regression model• Response variable Y

• Predictors X1, X2, …, Xp

ipipiii eXXXY ...21

•Same assumptions on errors ei

(independent, constant variance, normality)

III. Inference in regression

34

Inference in regression

• Most inference procedures assume independence, constant variance, and normality of the errors.

• Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold.

• In general, techniques like the bootstrap can be used when normality is suspect.


35

New data set

• Response variable: – Fuel = per-capita fuel consumption (times 1000)

• Predictors:– Dlic = proportion of the population who are licensed

drivers (times 1000)– Tax = gasoline tax rate– Income = per person income in thousands of dollars– logMiles = base 2 log of federal-aid highway miles in

the state


36

t tests• Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles

• The regression equation is• Fuel = 154 - 4.23 Tax + 0.472 Dlic - 6.14 Income +

18.5 logMiles

• Predictor Coef SE Coef T P• Constant 154.2 194.9 0.79 0.433• Tax -4.228 2.030 -2.08 0.043• Dlic 0.4719 0.1285 3.67 0.001• Income -6.135 2.194 -2.80 0.008• logMiles 18.545 6.472 2.87 0.006

t statisticsp values


37

t tests (2)

• The t statistics tests the hypothesis that a particular slope parameter is zero.

• The formula is

t = (coefficient estimate)/(standard error)

• degrees of freedom are n-(p+1)

• p-values given are for the two-sided alternative

• This is like simple linear regression


38

F tests• General structure:

– Ha: Large model– H0: Smaller model, obtained by setting some

parameters in the large model to zero, or equal to each other, or equal to a constant

– RSSAH = resid. sum of squares after fitting the large (alt. hypothesis) model

– RSSNH = resid. sum of squares after fitting the smaller (null hypothesis) model

– dfNH and dfAH are the corresponding degrees of freedom


39

F tests (2)

• Test statistic:

AH

AH

AHNH

AHNH

dfRSS

dfdfRSSRSS

F)(

)(

•Null distribution: F distribution with dfNH – dfAH numerator and dfAH denominator degrees of freedom


40

F test example

• Can the “economic” variables tax and income be dropped from the model with all four predictors?

• AH model includes all predictors

• NH model includes only Dlic and logMiles

• Fit both models and get RSS and df values


41

F test example (2)

• RSSAH = 193700; dfAH = 46

• RSSNH = 243006; dfNH = 48

85.546/193700

)4648/()193700243006(

F

•P-value is the area to the right of 5.85 under a F(2,46) distribution, approx. 0.0054

•There’s pretty strong evidence that removing both Tax and Income is unwise


42

Another F test example

• Question: Does it make sense that the two “economic” predictors should have the same coefficient?

• Ha: Y = β0 + β1Tax + β2 Dlic+ β3 Income + β4 logMiles + error

• H0: Y = β0 + β1Tax + β2 Dlic+ β1 Income + β4 logMiles + error

• Note: H0: Y = β0 + β1 (Tax + Income)+ β2 Dlic + β4 logMiles + error


43

Another F test example (2)

• Fit full model (AH)• Create new predictor “TI” by adding Tax and

Income, and fit a model with TI and Dlic and logMiles (NH)

424.046/193700

)4647/()193700195487(

F

•P-value is the area to the right of 5.85 under a F(1,46) distribution, approx. 0.518•This suggests that the simpler model with the same coefficient for Tax and Income fits well.


44

Removing one predictor

• We have two ways to test whether one predictor can be removed from the model:– t test– F test

• The tests are equivalent, in the sense that t2 = F, and that the p-values will be equivalent.


45

Confidence regions

• Confidence intervals for one parameter use the familiar t-interval.

• For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model:

• -6.135 ± (2.013)(2.194) = -6.135 ± 4.417.

From Minitab outputFrom t distribution with 46 df


46

Joint confidence regions

• Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution.

• Minitab (and SPSS, and …) can’t draw these easily

• On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R.


47

-8 -6 -4 -2 0

0.0

0.2

0.4

0.6

0.8

1.0

Tax

Dlic

Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two.

(0,0)

Boundary of confidence region


48

Prediction

• Given a new set of predictor values x1, x2, …, xp, what’s the predicted response?

• It’s easy to answer this: Just plug the new predictors into the fitted regression model:

ppxxxY ˆ...ˆˆˆˆ21

•But how do we assess the uncertainty in the prediction? How do we form a confidence interval?


49

Predicted Values for New Observations

New

Obs Fit SE Fit 95% CI 95% PI

1 613.39 12.44 (588.34, 638.44) (480.39, 746.39)

Values of Predictors for New Observations

New

Obs Dlic Income logMiles Tax

1 900 28.0 15.0 17.0

Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17

Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17


50

Diagnostics

• Want to look for points that have a large influence on the fitted model

• Want to look for evidence that one or more model assumptions are untrue.

• Tools:– Residuals– Leverage– Influence and Cook’s Distance


51

Leverage

• A point whose predictor values are far from the “typical” predictor values has high leverage.

• For a high leverage point, the fitted value

will be close to the data value Yi.

• A rule of thumb: Any point with leverage larger than 2(p+1)/n is interesting.

• Most statistical packages can compute leverages.

iY


52

15.012.510.07.55.0

13

12

11

10

9

8

7

6

5

4

x1

y3

0.236364

0.127273

0.172727

0.318182

0.172727

0.318182

0.127273

0.090909

0.236364

0.100000

0.100000

Scatterplot with leverages


53

50403020100

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Index

Levera

ge

0.2

Malaysia

Libya

Uruguay

Jamaica

ZambiaVenezuela

UnitedStates

UnitedKingdom

Tunisia

Turkey

Switzerland

Sweden

Spain

SouthRhodesia

SouthAfrica

Portugal

PhilippinesPeruParaguayPanama

NicaraguaNewZealand

Netherlands

Norway

MaltaLuxembourgKorea

Japan

Italy

Ireland

IndiaIcelandHondurasGuatamala

GreeceGermany

France

FinlandEcuadorDenmark

CostaRicaColombiaChina

Chile

Canada

BrazilBoliviaBelgium

Austria

Australia

Scatterplot of Leverage vs Index


54

Influential Observations

• A data point is influential if it has a large effect on the fitted model.

• Put another way, an observation is influential if the fitted model will change a lot if the observation is deleted.

• Cook’s Distance is a measure of the influence of an observation.

• It may make sense to refit the model after removing a few of the most influential observations.


55

15.012.510.07.55.0

13

12

11

10

9

8

7

6

5

4

x1

y3

0.00695

0.00035

0.05954

0.03382

0.00052

0.30057

0.02598

0.00547

1.39285

0.00214

0.01176

Scatterplot with Cook's Distance (measure of influence)

High leverage, low influence High Influence


56

50403020100

0.30

0.25

0.20

0.15

0.10

0.05

0.00

Index

Cook'

s Dis

tance

Malaysia

Libya

UruguayJamaica

Zambia

VenezuelaUnitedStatesUnitedKingdomTunisiaTurkeySwitzerland

Sweden

SpainSouthRhodesiaSouthAfricaPortugal

PhilippinesPeruParaguay

PanamaNicaraguaNewZealandNetherlandsNorway

MaltaLuxembourg

Korea

Japan

Italy

Ireland

India

Iceland

HondurasGuatamalaGreece

GermanyFrance

FinlandEcuador

DenmarkCostaRica

ColombiaChina

Chile

CanadaBrazil

BoliviaBelgiumAustriaAustralia

Scatterplot of Cook's Distance vs Index

V. Model Selection 57

Model Selection

• Question: With a large number of potential predictors, how do we choose the predictors to include in the model?

• Want good prediction, but parsimony: Occam’s Razor.

• Also can be thought of as a bias-variance tradeoff.


Model Selection Example

• Data on all 50 states, from the 1970s– Life.Exp = Life expectancy (response)– Population (in thousands)– Income = per-capita income– Illiteracy (in percent of population)– Murder = murder rate per 100,000– HS.Grad (in percent of population)– Frost = mean # days with min. temp < 32F– Area = land area in square miles


Forward Selection

• Choose a cutoff α

• Start with no predictors

• At each step, add the predictor with the lowest p-value less than α

• Continue until there are no unused predictors with p-values less than α


• Stepwise Regression: Life.Exp versus Population, Income, ...

• Forward selection. Alpha-to-Enter: 0.25

• Response is Life.Exp on 7 predictors, with N = 50

• Step 1 2 3 4• Constant 72.97 70.30 71.04 71.03

• Murder -0.284 -0.237 -0.283 -0.300• T-Value -8.66 -6.72 -7.71 -8.20• P-Value 0.000 0.000 0.000 0.000

• HS.Grad 0.044 0.050 0.047• T-Value 2.72 3.29 3.14• P-Value 0.009 0.002 0.003

• Frost -0.0069 -0.0059• T-Value -2.82 -2.46• P-Value 0.007 0.018

• Population 0.00005• T-Value 2.00• P-Value 0.052

• S 0.847 0.796 0.743 0.720• R-Sq 60.97 66.28 71.27 73.60• R-Sq(adj) 60.16 64.85 69.39 71.26• Mallows Cp 16.1 9.7 3.7 2.0


Variations on FS

• Backward elimination– Choose cutoff α– Start with all predictors in the model– Eliminate the predictor with the highest p-

value that is greater than α– ETC

• Stepwise: Allow addition or elimination at each step (hybrid of FS and BE)


All subsets

• Fit all possible models.

• Based on a “goodness” criterion, choose the model that fits best.

• Goodness criteria include AIC, BIC, Adjusted R2, Mallow’s Cp

• Some of the criteria will be described next


Notation

• RSS* = Resid. Sum of Squares for the current model

• p* = Number of terms (including intercept) in the current model

• n = number of observations

• s2 = RSS/(n-(p+1)) = Estimate of σ2 from model with all predictors and intercept term.


Goodness criteria

• Smaller is better for AIC, BIC, Cp*. Larger is better for adjR2

• AIC = n log(RSS*/n) + 2p*• BIC = n log(RSS*/n) + p* log(n)

• Cp* = RSS*/s2 + 2p* - n

• adjR2 = )1(

)1(

11 2R

pn

n


• Best Subsets Regression: Life.Exp versus Population, Income, ...

• Response is Life.Exp

• P I• o l• p l• u i H• l I t M S• a n e u . F• t c r r G r A• i o a d r o r• Mallows o m c e a s e• Vars R-Sq R-Sq(adj) Cp S n e y r d t a• 1 61.0 60.2 16.1 0.84732 X• 2 66.3 64.8 9.7 0.79587 X X• 3 71.3 69.4 3.7 0.74267 X X X• 4 73.6 71.3 2.0 0.71969 X X X X• 5 73.6 70.6 4.0 0.72773 X X X X X• 6 73.6 69.9 6.0 0.73608 X X X X X X• 7 73.6 69.2 8.0 0.74478 X X X X X X X


Model selection can overstate significance

• Generate Y and X1, X2, …, X50

• All are independent and standard normal.• So none of the predictors are related to

the response.

• Fit the full model and look at the overall F test.

• Use model selection to choose a “good” smaller model, and look at its overall F test


The full model

• Results from fitting model with all 50 predictors

• Note that the F test is not significant

• S = 0.915237 R-Sq = 57.6% R-Sq(adj) = 14.3%

• Analysis of Variance

• Source DF SS MS F P• Regression 50 55.7093 1.1142 1.33 0.160• Residual Error 49 41.0453 0.8377• Total 99 96.7546


The “good” small model

• Run FS with α = 0.05• Predictors x38, x41, and x24 are chosen.• Fit that three predictor model. Now the F test is

highly significant

• Analysis of Variance

• Source DF SS MS F P• Regression 3 20.9038 6.9679 8.82 0.000• Residual Error 96 75.8508 0.7901• Total 99 96.7546

What’s left?

• Weighted least squares

• Tests for lack of fit

• Transformations of response and predictors

• Analysis of Covariance

• Etc.

applied linear regression cstat workshop march 16, 2007 vince melfi

Documents

regression iv

multiple regression

regression equation

model slide

regression line y

minitab slide

denoted slide

linear models