# new chapter 11: simple linear regression (slr) and...

of 29 /29
Chapter 11: SIMPLE LINEAR REGRESSION (SLR) AND CORRELATION Part 2: Properties, Hypothesis tests, Model adequacy and assumptions Sections 11-3, 11-4.1, 11-7 Recall the SLR estimates for β 0 1 and σ 2 : ˆ β 0 y - ˆ β 1 ¯ x ˆ β 1 = n i=1 (x i - ¯ x)(y i - ¯ y ) n i=1 (x i - ¯ x) 2 = S xy S xx ˆ σ 2 = SS E n - 2 = n i=1 (y i - ˆ y i ) 2 n - 2 = MSE These estimators are unbiased estimators (a nice characteristic): * E [ ˆ β 0 ]= β 0 * E [ ˆ β 1 ]= β 1 * E [ ˆ σ 2 ]= σ 2 1

Author: others

Post on 11-Oct-2020

2 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• Chapter 11: SIMPLE LINEARREGRESSION (SLR)AND CORRELATION

Part 2: Properties, Hypothesis tests,Model adequacy and assumptions

Sections 11-3, 11-4.1, 11-7

• Recall the SLR estimates for β0, β1 and σ2:β̂0 = ȳ − β̂1x̄

β̂1 =

∑ni=1(xi − x̄)(yi − ȳ)∑n

i=1(xi − x̄)2=SxySxx

σ̂2 =SSEn− 2

=

∑ni=1(yi − ŷi)2

n− 2= MSE

• These estimators are unbiased estimators(a nice characteristic):

* E[β̂0] = β0* E[β̂1] = β1

* E[σ̂2] = σ2

1

• What kind of variability do these least squaresestimators have?

• Variance of β̂1 (a random variable):

V ar(β̂1) =σ2∑n

i=1(xi − x̄)2=

σ2

Sxx

The variance of the estimated slope dependson. . . the variability of the errors σ2, theamount of data n, the spread of the x-values.

Since we don’t know σ2, we’ll plug-in the es-

timate σ̂2 to get a usable value of the...

• Estimated standard error for β̂1:

se(β̂1) =

√σ̂2∑n

i=1(xi−x̄)2

2

• • Variance of β̂0 (a random variable):

V ar(β̂0) = σ2

(1

n+

x̄2∑ni=1(xi − x̄)2

)The variance of the estimated interceptdepends on... the error variability σ2, theamount of data n, the spread of the x-values,AND how far the center of the x-values (i.e. x̄)is from x = 0.

We have more precision (lower variability) forestimating β0 when the data are near x=0(compared to being far from x=0).

We’ll plug-in the estimate σ̂2 to get the...

• Estimated standard error for β̂0

se(β̂0) =

√σ̂2(

1n +

x̄2∑ni=1(xi−x̄)2

)3

• Hypothesis tests for β0 and β1

• For SLR, a common hypothesis test is thetest for a linear relationship between X and Y .

H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0

• Under the assumption �iiid∼ N(0, σ2), we

have

β̂0 ∼ N(β0, σ

2(

1n +

x̄2∑ni=1(xi−x̄)2

))

β̂1 ∼ N(β1,

σ2∑ni=1(xi−x̄)2

)• Test of interest for the slope:H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0

4

• • Since we will be estimating σ2, we will use at-statistic:

T0 =β̂1 − 0se(β̂1)

=β̂1√σ̂2∑n

i=1(xi−x̄)2

Under H0 true, T0 ∼ tn−2.

From our test statistic, we can compute ap-value for our hypothesis test on the slope.

• Test of interest for the intercept:H0 : β0 = 0 vs. H1 : β0 6= 0

The test statistic:

T0 =β̂0 − 0se(β̂0)

=β̂0√

σ̂2(

1n +

x̄2∑ni=1(xi−x̄)2

)Under H0 true, T0 ∼ tn−2.

5

• • Example: Chloride concentration in Streamsvs. Roadway area in watersheds(Problem 11-10 in book)

An article in the Journal of EnvironmentalEngineering reported the results of a studyon the occurrence of sodium and chloride insurface streams in central Rhode Island.

They found that watersheds with a largerpercentage of the land in roadways tended tohave higher chloride concentrations (mg/liter)in the streams.

●●●●

● ●

● ●●

●●

0.5 1.0 1.5

510

1520

2530

3540

chlo

ride

conc

entr

atio

n (m

g/lit

er)

6

• The data:

1 0.15 6.6

2 0.19 4.4

3 0.47 11.8

4 0.57 9.7

5 0.60 14.3

6 0.63 10.9

7 0.67 10.8

8 0.69 19.2

9 0.70 10.6

10 0.70 12.1

11 0.78 14.7

12 0.78 17.3

13 0.81 15.0

14 1.05 27.4

15 1.06 27.7

16 1.30 23.1

17 1.62 39.5

18 1.74 31.8

n = 18

7

• Summary statistics:∑ni=1 xi = 14.51 x̄ = 0.8061∑ni=1 yi = 306.9 ȳ = 17.05∑ni=1 x

2i = 14.7073

∑ni=1 y

2i = 6727.13∑n

i=1(yi − ȳ)(xi − x̄) = 61.9205∑ni=1(xi − x̄)2 = 3.0106

The regression coefficient estimates:

β̂1 =

∑ni=1(xi − x̄)(yi − ȳ)∑n

i=1(xi − x̄)2=

61.9205

3.0106= 20.5675

β̂0 = ȳ − β̂1x̄ = 17.05− 20.5675(0.8061)= 0.4705

To estimate σ2, we need the residuals whichare denoted as ei = yi − ŷi.

8

• To get the residuals or ei = yi − ŷi we firstneed the fitted values (or predicted values)denoted as ŷi...

ŷi = 0.4705 + 20.5675(xi)

Above is the fitted model or fitted line.

Below, we add the residuals (RESI1) andfitted values (FITS1) to our data set...

9

• σ̂2 = MSE =SSEn− 2

=

∑ni=1(yi − ŷi)2

n− 2

=220.9472

16= 13.8092

and σ̂ =√

13.8092 = 3.7161

The fitted model: ŷi = 0.4705+20.5675(xi)

10

• Interpretation of regression coefficients:

For any SLR analysis, β̂1 is the estimatedslope. It represents the expected change inY for a 1 unit change in X .

β̂1 =riserun =

4Y4X =

β̂1 units of Y1 unit of X

0.5 1.0 1.5

510

1520

2530

3540

x

y

1

β̂1

11

• Interpretation of regression coefficients:

slope: β̂1 = 20.5675 =20.5675

1 =riserun =

4Y4X

A 1 percentage point increase in the amountof land in roadways is associated with anincrease of 20.5675 mg/liter in the meanchloride concentration.

●●●●

● ●

● ●●

●●

0.5 1.0 1.5

510

1520

2530

3540

chlo

ride

conc

entr

atio

n (m

g/lit

er)

1

20.5675

12

• Interpretation of regression coefficients:

Intercept: β̂0 = 0.4705

When 0% of the watershed is in roadways,the expected chloride concentration is 0.4705mg/liter (see how this relates to the hy-pothesis test for β0 in the next slides).

0.0 0.5 1.0 1.5

010

2030

40

chlo

ride

conc

entr

atio

n (m

g/lit

er)

●●●●

● ●

● ●●

●●

13

• •When providing regression COEFFICIENTINTERPRETATION, YOU MUST in-clude the relevant units for X and Y , andput it in the context of the problem.

MINITAB output from this example:

Regression Equation

ChlorConc = 0.47 + 20.6 PercRoadways

Coefficients

Term Coef SE Coef T-Value P-Value

Constant 0.470 1.94 0.24 0.811

Model Summary

S = 3.71607 R-sq = 85.22%

• Testing for a linear relationship between chlo-ride concentration (Y ) and % of watershedin roadways (X).

H0 : β1 = 0H1 : β1 6= 014

• Slope estimate and standard error:

β̂1 = 20.567 se(β̂1) =√

13.80923.0106 = 2.1417

Test statistic:

t0 =β̂1 − 0se(β̂1)

=20.567

2.1417= 9.603

Under H0 true, T0 ∼ t16

P-value: 2× P (T0 > 9.603) = 4.81× 10−8{very small}

Reject H0.

There IS statistically significant evidence thatthe slope is not 0, so there is evidence of alinear relationship between chloride concen-tration and % of watershed in roadways.

15

• • Similarly, we can run a hypothesis test thatthe intercept equals 0...

H0 : β0 = 0H1 : β0 6= 0

Estimates:β̂0 = 0.4705

se(β̂0) =

√13.8092

(118 +

0.806123.0106

)= 1.9358

Test statistic:

t0 =β̂0 − 0se(β̂0)

=0.4705

1.9358= 0.2431

Under H0 true, T0 ∼ t16

P-value: 2× P (T0 > 0.2431) = 0.8110

16

• Fail to reject H0.

This intercept or β0 is not significantlydifferent than zero, suggesting that whenthere’s no roadways in a watershed, there’sno real evidence against the chloride concen-tration in the streams being zero.

We do not have evidence to suggest the in-tercept is anything other than zero. (So, awatershed with no roadways essentially has a chloride

concentration of 0 mg/liter.)

0.0 0.5 1.0 1.5

010

2030

40

chlo

ride

conc

entr

atio

n (m

g/lit

er)

●●●●

● ●

● ●●

●●

17

• Adequacy of the regression model andChecking assumptions

• Is a linear model the correct model?(Is simple linear regression complex enoughto capture the relationship betweenX & Y ?)

• Are the assumptions we’re making for ourmodel reasonable, or are they violated?

• To answer these questions, we will use theresiduals of the model.

The residual for observations i:

ei = yi − ŷi

18

• Residuals are informative

Consider the Price vs. Age of clock data:

1000

1400

1800

2200

125 150 175Age of Clock (yrs)

Pric

e So

ld a

t Auc

tion

5.0

7.5

10.0

12.5

15.0Bidders

Use plot of residuals vs. ŷ fitted values (below)to check adequacy of model AND constant vari-ance assumption.

19

• • If this plot is a random scatter of points aboveand below the horizontal reference line, thenthe linear model is reasonable, and adequate.

• If not (i.e. if there is a non-random patternin the residual plot), then there may be is-sues with our linearity assumption or per-haps other assumptions in our model and themodel may not be adequate.

20

• • Example showing inadequacy:Kentucky Derby data set

on year of race and speed of horse.

The form of the scatterplot looks a bit non-linear, but we’ll go ahead and fit a straightline model first to get the following residualplot...

21

• Residual Plot of ‘residuals vs. fitted values’

• Residuals have a bit of a pattern (e.g. be-low the line, above the line, below the line),not randomly scattered above and belowthe horizontal line.

• Linear form may not be reasonable oradequate.

22

• Besides checking that our model fits the gen-eral (linear) relationship between X and Y,we also need to consider the assumptionswe made in our model.

• The basic model

Yi = β0 + β1xi + �i︸ ︷︷ ︸ ↑linear random

relationship error term

with �iiid∼ N(0, σ2)

– Constant variance of errors(only one σ2 for all errors)

– Normality of errors

– Independence of errors

23

• Constant Variance Assumption

•We’ll check this assumption by plotting theresiduals vs. the fitted values (or vs. the ex-planatory variable in SLR)

• Look for a constant ‘spread’ above and belowthe horizontal reference line.

• NOTE: This same residual plot was also used to checklinearity.

24

• •Constant Variance and Adequacy areboth checked with the same residualplot in SLR

• Plot residuals vs. ŷ (or in SLR, against x).

25

• Normality Assumption

• Use normal probability plot of residualsto check normality of errors (see section 6-6for non-normal patterns like those below).

26

• Independence Assumption

• Verify that the observations are independent.

• Check how the data was collected (talk tothe researcher or client).

• If data was collected over time, plot residu-als against time to make sure there isn’t adependence (or trend) across time.

27

• • Predictions and Extrapolation– We can use our fitted model to make pre-

dictions.

– e.g.What is the expected longevity in days ofa fruitfly with a thorax of length 0.80 mm?

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

0.65 0.70 0.75 0.80 0.85 0.90 0.95

2040

6080

100

ff.data\$Thorax

ff.da

ta\$L

onge

vity

Ŷ = −61.05 + 144.33 x

28

• Prediction:

Ŷx=0.80 = −61.05 + 144.33(0.80)= 54.414 days

– If we try to predict Y outside of the rangeof observed x-values, we are using the modelto extrapolate (predict outside the rangeof the observed data).

– You should be very careful when using ex-trapolation. In general it should be avoidedas we don’t have a feel for what is goingon outside the observed range.

– Predicting Ŷ for x = 1.50 mm (which isnot a value near the observed x-values)would be an extrapolation in this fruitflyexample.

29