new chapter 11: simple linear regression (slr) and...

Chapter 11: SIMPLE LINEARREGRESSION (SLR)AND CORRELATION

Part 2: Properties, Hypothesis tests,Model adequacy and assumptions

Sections 11-3, 11-4.1, 11-7

• Recall the SLR estimates for β0, β1 and σ2:

β0 = y − β1x

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=SxySxx

σ2 =SSEn− 2

=

∑ni=1(yi − yi)2

n− 2= MSE

• These estimators are unbiased estimators(a nice characteristic):

* E[β0] = β0

* E[β1] = β1

* E[σ2] = σ2

1

What kind of variability do these least squaresestimators have?

• Variance of β1 (a random variable):

V ar(β1) =σ2∑n

i=1(xi − x)2=

σ2

Sxx

The variance of the estimated slope dependson. . . the variability of the errors σ2, theamount of data n, the spread of the x-values.

Since we don’t know σ2, we’ll plug-in the es-

timate σ2 to get a usable value of the...

• Estimated standard error for β1:

se(β1) =

√σ2∑n

i=1(xi−x)2

2

• Variance of β0 (a random variable):

V ar(β0) = σ2

(1

n+

x2∑ni=1(xi − x)2

)The variance of the estimated interceptdepends on... the error variability σ2, theamount of data n, the spread of the x-values,AND how far the center of the x-values (i.e. x)is from x = 0.

We have more precision (lower variability) forestimating β0 when the data are near x=0(compared to being far from x=0).

We’ll plug-in the estimate σ2 to get the...

• Estimated standard error for β0

se(β0) =

√σ2(

1n + x2∑n

i=1(xi−x)2

)3

Hypothesis tests for β0 and β1

• For SLR, a common hypothesis test is thetest for a linear relationship between X and Y .

H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0

• Under the assumption εiiid∼ N(0, σ2), we

have

β0 ∼ N(β0, σ2

(1n + x2∑n

i=1(xi−x)2

))

β1 ∼ N(β1,

σ2∑ni=1(xi−x)2

)• Test of interest for the slope:

H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0

4

• Since we will be estimating σ2, we will use at-statistic:

T0 =β1 − 0

se(β1)=

β1√σ2∑n

i=1(xi−x)2

Under H0 true, T0 ∼ tn−2.

From our test statistic, we can compute ap-value for our hypothesis test on the slope.

• Test of interest for the intercept:

H0 : β0 = 0 vs. H1 : β0 6= 0

The test statistic:

T0 =β0 − 0

se(β0)=

β0√σ2(

1n + x2∑n

i=1(xi−x)2

)Under H0 true, T0 ∼ tn−2.

5

• Example: Chloride concentration in Streamsvs. Roadway area in watersheds(Problem 11-10 in book)

An article in the Journal of EnvironmentalEngineering reported the results of a studyon the occurrence of sodium and chloride insurface streams in central Rhode Island.

They found that watersheds with a largerpercentage of the land in roadways tended tohave higher chloride concentrations (mg/liter)in the streams.

●

●

●●●●

● ●

● ●●

●

●

●

●●

●

●

0.5 1.0 1.5

510

1520

2530

3540

roadway area in watershed (%)

chlo

ride

conc

entr

atio

n (m

g/lit

er)

6

The data:

obs PercRoadways ChlorConc

1 0.15 6.6

2 0.19 4.4

3 0.47 11.8

4 0.57 9.7

5 0.60 14.3

6 0.63 10.9

7 0.67 10.8

8 0.69 19.2

9 0.70 10.6

10 0.70 12.1

11 0.78 14.7

12 0.78 17.3

13 0.81 15.0

14 1.05 27.4

15 1.06 27.7

16 1.30 23.1

17 1.62 39.5

18 1.74 31.8

n = 18

7

Summary statistics:∑ni=1 xi = 14.51 x = 0.8061∑ni=1 yi = 306.9 y = 17.05∑ni=1 x

2i = 14.7073

∑ni=1 y

2i = 6727.13∑n

i=1(yi − y)(xi − x) = 61.9205∑ni=1(xi − x)2 = 3.0106

The regression coefficient estimates:

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=

61.9205

3.0106= 20.5675

β0 = y − β1x = 17.05− 20.5675(0.8061)= 0.4705

To estimate σ2, we need the residuals whichare denoted as ei = yi − yi.

8

To get the residuals or ei = yi − yi we firstneed the fitted values (or predicted values)denoted as yi...

yi = 0.4705 + 20.5675(xi)

Above is the fitted model or fitted line.

Below, we add the residuals (RESI1) andfitted values (FITS1) to our data set...

9

σ2 = MSE =SSEn− 2

=

∑ni=1(yi − yi)2

n− 2

=220.9472

16= 13.8092

and σ =√

13.8092 = 3.7161

The fitted model: yi = 0.4705+20.5675(xi)

10

Interpretation of regression coefficients:

For any SLR analysis, β1 is the estimatedslope. It represents the expected change inY for a 1 unit change in X .

β1 = riserun = 4Y4X = β1 units of Y

1 unit of X

0.5 1.0 1.5

510

1520

2530

3540

x

y

1

β1

11


slope: β1 = 20.5675 = 20.56751 = rise

run = 4Y4X

A 1 percentage point increase in the amountof land in roadways is associated with anincrease of 20.5675 mg/liter in the meanchloride concentration.

●

●

●●●●

● ●

● ●●

●

●

●

●●

●

●

0.5 1.0 1.5

510

1520

2530

3540


chlo

ride

conc

entr

atio

n (m

g/lit

er)

1

20.5675

12


Intercept: β0 = 0.4705

When 0% of the watershed is in roadways,the expected chloride concentration is 0.4705mg/liter (see how this relates to the hy-pothesis test for β0 in the next slides).

0.0 0.5 1.0 1.5

010

2030

40


chlo

ride

conc

entr

atio

n (m

g/lit

er)

●

●

●●●●

● ●

● ●●

●

●

●

●●

●

●

13

•When providing regression COEFFICIENTINTERPRETATION, YOU MUST in-clude the relevant units for X and Y , andput it in the context of the problem.

MINITAB output from this example:

Regression Analysis: ChlorConc vs PercRoadways

Regression Equation

ChlorConc = 0.47 + 20.6 PercRoadways

Coefficients

Term Coef SE Coef T-Value P-Value

Constant 0.470 1.94 0.24 0.811

PercRoadways 20.567 2.14 9.60 0.000

Model Summary

S = 3.71607 R-sq = 85.22%

• Testing for a linear relationship between chlo-ride concentration (Y ) and % of watershedin roadways (X).

H0 : β1 = 0H1 : β1 6= 0

14

Slope estimate and standard error:

β1 = 20.567 se(β1) =√

13.80923.0106 = 2.1417

Test statistic:

t0 =β1 − 0

se(β1)=

20.567

2.1417= 9.603

Under H0 true, T0 ∼ t16

P-value: 2× P (T0 > 9.603) = 4.81× 10−8

{very small}Reject H0.

There IS statistically significant evidence thatthe slope is not 0, so there is evidence of alinear relationship between chloride concen-tration and % of watershed in roadways.

15

• Similarly, we can run a hypothesis test thatthe intercept equals 0...

H0 : β0 = 0H1 : β0 6= 0

Estimates:β0 = 0.4705

se(β0) =

√13.8092

(118 + 0.80612

3.0106

)= 1.9358

Test statistic:

t0 =β0 − 0

se(β0)=

0.4705

1.9358= 0.2431

Under H0 true, T0 ∼ t16

P-value: 2× P (T0 > 0.2431) = 0.8110

16

Fail to reject H0.

This intercept or β0 is not significantlydifferent than zero, suggesting that whenthere’s no roadways in a watershed, there’sno real evidence against the chloride concen-tration in the streams being zero.

We do not have evidence to suggest the in-tercept is anything other than zero. (So, a

watershed with no roadways essentially has a chloride

concentration of 0 mg/liter.)

0.0 0.5 1.0 1.5

010

2030

40


chlo

ride

conc

entr

atio

n (m

g/lit

er)

●

●

●●●●

● ●

● ●●

●

●

●

●●

●

●

17

Adequacy of the regression model andChecking assumptions

• Is a linear model the correct model?(Is simple linear regression complex enoughto capture the relationship betweenX & Y ?)

• Are the assumptions we’re making for ourmodel reasonable, or are they violated?

• To answer these questions, we will use theresiduals of the model.

The residual for observations i:

ei = yi − yi

18

Residuals are informative

Consider the Price vs. Age of clock data:

1000

1400

1800

2200

125 150 175Age of Clock (yrs)

Pric

e So

ld a

t Auc

tion

5.0

7.5

10.0

12.5

15.0Bidders

Use plot of residuals vs. y fitted values (below)to check adequacy of model AND constant vari-ance assumption.

19

• If this plot is a random scatter of points aboveand below the horizontal reference line, thenthe linear model is reasonable, and adequate.

• If not (i.e. if there is a non-random patternin the residual plot), then there may be is-sues with our linearity assumption or per-haps other assumptions in our model and themodel may not be adequate.

20

• Example showing inadequacy:Kentucky Derby data set

on year of race and speed of horse.

The form of the scatterplot looks a bit non-linear, but we’ll go ahead and fit a straightline model first to get the following residualplot...

21

Residual Plot of ‘residuals vs. fitted values’

• Residuals have a bit of a pattern (e.g. be-low the line, above the line, below the line),not randomly scattered above and belowthe horizontal line.

• Linear form may not be reasonable oradequate.

⇒ Quadratic may fit better.

22

Beyond Adequacy

• Besides checking that our model fits the gen-eral (linear) relationship between X and Y,we also need to consider the assumptionswe made in our model.

• The basic model

Yi = β0 + β1xi + εi︸︷︷︸ ↑linear random

relationship error term

with εiiid∼ N(0, σ2)

– Constant variance of errors(only one σ2 for all errors)

– Normality of errors

– Independence of errors

23

Constant Variance Assumption

•We’ll check this assumption by plotting theresiduals vs. the fitted values (or vs. the ex-planatory variable in SLR)

• Look for a constant ‘spread’ above and belowthe horizontal reference line.

• NOTE: This same residual plot was also used to check

linearity.

24

•Constant Variance and Adequacy areboth checked with the same residualplot in SLR

• Plot residuals vs. y (or in SLR, against x).

25

Normality Assumption

• Use normal probability plot of residualsto check normality of errors (see section 6-6for non-normal patterns like those below).

26

Independence Assumption

• Verify that the observations are independent.

• Check how the data was collected (talk tothe researcher or client).

• If data was collected over time, plot residu-als against time to make sure there isn’t adependence (or trend) across time.

27

• Predictions and Extrapolation

– We can use our fitted model to make pre-dictions.

– e.g.What is the expected longevity in days ofa fruitfly with a thorax of length 0.80 mm?

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

● ●

●● ●

●

●●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

0.65 0.70 0.75 0.80 0.85 0.90 0.95

2040

6080

100

ff.data$Thorax

ff.da

ta$L

onge

vity

Y = −61.05 + 144.33 x

28

Prediction:

Yx=0.80 = −61.05 + 144.33(0.80)= 54.414 days

– If we try to predict Y outside of the rangeof observed x-values, we are using the modelto extrapolate (predict outside the rangeof the observed data).

– You should be very careful when using ex-trapolation. In general it should be avoidedas we don’t have a feel for what is goingon outside the observed range.

– Predicting Y for x = 1.50 mm (which isnot a value near the observed x-values)would be an extrapolation in this fruitflyexample.

29

new chapter 11: simple linear regression (slr) and...

Documents