new chapter 11: simple linear regression (slr) and...
Embed Size (px)
TRANSCRIPT

Chapter 11: SIMPLE LINEARREGRESSION (SLR)AND CORRELATION
Part 2: Properties, Hypothesis tests,Model adequacy and assumptions
Sections 113, 114.1, 117
• Recall the SLR estimates for β0, β1 and σ2:β̂0 = ȳ − β̂1x̄
β̂1 =
∑ni=1(xi − x̄)(yi − ȳ)∑n
i=1(xi − x̄)2=SxySxx
σ̂2 =SSEn− 2
=
∑ni=1(yi − ŷi)2
n− 2= MSE
• These estimators are unbiased estimators(a nice characteristic):
* E[β̂0] = β0* E[β̂1] = β1
* E[σ̂2] = σ2
1

What kind of variability do these least squaresestimators have?
• Variance of β̂1 (a random variable):
V ar(β̂1) =σ2∑n
i=1(xi − x̄)2=
σ2
Sxx
The variance of the estimated slope dependson. . . the variability of the errors σ2, theamount of data n, the spread of the xvalues.
Since we don’t know σ2, we’ll plugin the es
timate σ̂2 to get a usable value of the...
• Estimated standard error for β̂1:
se(β̂1) =
√σ̂2∑n
i=1(xi−x̄)2
2

• Variance of β̂0 (a random variable):
V ar(β̂0) = σ2
(1
n+
x̄2∑ni=1(xi − x̄)2
)The variance of the estimated interceptdepends on... the error variability σ2, theamount of data n, the spread of the xvalues,AND how far the center of the xvalues (i.e. x̄)is from x = 0.
We have more precision (lower variability) forestimating β0 when the data are near x=0(compared to being far from x=0).
We’ll plugin the estimate σ̂2 to get the...
• Estimated standard error for β̂0
se(β̂0) =
√σ̂2(
1n +
x̄2∑ni=1(xi−x̄)2
)3

Hypothesis tests for β0 and β1
• For SLR, a common hypothesis test is thetest for a linear relationship between X and Y .
H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0
• Under the assumption �iiid∼ N(0, σ2), we
have
β̂0 ∼ N(β0, σ
2(
1n +
x̄2∑ni=1(xi−x̄)2
))
β̂1 ∼ N(β1,
σ2∑ni=1(xi−x̄)2
)• Test of interest for the slope:H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0
4

• Since we will be estimating σ2, we will use atstatistic:
T0 =β̂1 − 0se(β̂1)
=β̂1√σ̂2∑n
i=1(xi−x̄)2
Under H0 true, T0 ∼ tn−2.
From our test statistic, we can compute apvalue for our hypothesis test on the slope.
• Test of interest for the intercept:H0 : β0 = 0 vs. H1 : β0 6= 0
The test statistic:
T0 =β̂0 − 0se(β̂0)
=β̂0√
σ̂2(
1n +
x̄2∑ni=1(xi−x̄)2
)Under H0 true, T0 ∼ tn−2.
5

• Example: Chloride concentration in Streamsvs. Roadway area in watersheds(Problem 1110 in book)
An article in the Journal of EnvironmentalEngineering reported the results of a studyon the occurrence of sodium and chloride insurface streams in central Rhode Island.
They found that watersheds with a largerpercentage of the land in roadways tended tohave higher chloride concentrations (mg/liter)in the streams.
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
0.5 1.0 1.5
510
1520
2530
3540
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
6

The data:
obs PercRoadways ChlorConc
1 0.15 6.6
2 0.19 4.4
3 0.47 11.8
4 0.57 9.7
5 0.60 14.3
6 0.63 10.9
7 0.67 10.8
8 0.69 19.2
9 0.70 10.6
10 0.70 12.1
11 0.78 14.7
12 0.78 17.3
13 0.81 15.0
14 1.05 27.4
15 1.06 27.7
16 1.30 23.1
17 1.62 39.5
18 1.74 31.8
n = 18
7

Summary statistics:∑ni=1 xi = 14.51 x̄ = 0.8061∑ni=1 yi = 306.9 ȳ = 17.05∑ni=1 x
2i = 14.7073
∑ni=1 y
2i = 6727.13∑n
i=1(yi − ȳ)(xi − x̄) = 61.9205∑ni=1(xi − x̄)2 = 3.0106
The regression coefficient estimates:
β̂1 =
∑ni=1(xi − x̄)(yi − ȳ)∑n
i=1(xi − x̄)2=
61.9205
3.0106= 20.5675
β̂0 = ȳ − β̂1x̄ = 17.05− 20.5675(0.8061)= 0.4705
To estimate σ2, we need the residuals whichare denoted as ei = yi − ŷi.
8

To get the residuals or ei = yi − ŷi we firstneed the fitted values (or predicted values)denoted as ŷi...
ŷi = 0.4705 + 20.5675(xi)
Above is the fitted model or fitted line.
Below, we add the residuals (RESI1) andfitted values (FITS1) to our data set...
9

σ̂2 = MSE =SSEn− 2
=
∑ni=1(yi − ŷi)2
n− 2
=220.9472
16= 13.8092
and σ̂ =√
13.8092 = 3.7161
The fitted model: ŷi = 0.4705+20.5675(xi)
10

Interpretation of regression coefficients:
For any SLR analysis, β̂1 is the estimatedslope. It represents the expected change inY for a 1 unit change in X .
β̂1 =riserun =
4Y4X =
β̂1 units of Y1 unit of X
0.5 1.0 1.5
510
1520
2530
3540
x
y
1
β̂1
11

Interpretation of regression coefficients:
slope: β̂1 = 20.5675 =20.5675
1 =riserun =
4Y4X
A 1 percentage point increase in the amountof land in roadways is associated with anincrease of 20.5675 mg/liter in the meanchloride concentration.
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
0.5 1.0 1.5
510
1520
2530
3540
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
1
20.5675
12

Interpretation of regression coefficients:
Intercept: β̂0 = 0.4705
When 0% of the watershed is in roadways,the expected chloride concentration is 0.4705mg/liter (see how this relates to the hypothesis test for β0 in the next slides).
0.0 0.5 1.0 1.5
010
2030
40
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
13

•When providing regression COEFFICIENTINTERPRETATION, YOU MUST include the relevant units for X and Y , andput it in the context of the problem.
MINITAB output from this example:
Regression Analysis: ChlorConc vs PercRoadways
Regression Equation
ChlorConc = 0.47 + 20.6 PercRoadways
Coefficients
Term Coef SE Coef TValue PValue
Constant 0.470 1.94 0.24 0.811
PercRoadways 20.567 2.14 9.60 0.000
Model Summary
S = 3.71607 Rsq = 85.22%
• Testing for a linear relationship between chloride concentration (Y ) and % of watershedin roadways (X).
H0 : β1 = 0H1 : β1 6= 014

Slope estimate and standard error:
β̂1 = 20.567 se(β̂1) =√
13.80923.0106 = 2.1417
Test statistic:
t0 =β̂1 − 0se(β̂1)
=20.567
2.1417= 9.603
Under H0 true, T0 ∼ t16
Pvalue: 2× P (T0 > 9.603) = 4.81× 10−8{very small}
Reject H0.
There IS statistically significant evidence thatthe slope is not 0, so there is evidence of alinear relationship between chloride concentration and % of watershed in roadways.
15

• Similarly, we can run a hypothesis test thatthe intercept equals 0...
H0 : β0 = 0H1 : β0 6= 0
Estimates:β̂0 = 0.4705
se(β̂0) =
√13.8092
(118 +
0.806123.0106
)= 1.9358
Test statistic:
t0 =β̂0 − 0se(β̂0)
=0.4705
1.9358= 0.2431
Under H0 true, T0 ∼ t16
Pvalue: 2× P (T0 > 0.2431) = 0.8110
16

Fail to reject H0.
This intercept or β0 is not significantlydifferent than zero, suggesting that whenthere’s no roadways in a watershed, there’sno real evidence against the chloride concentration in the streams being zero.
We do not have evidence to suggest the intercept is anything other than zero. (So, awatershed with no roadways essentially has a chloride
concentration of 0 mg/liter.)
0.0 0.5 1.0 1.5
010
2030
40
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
17

Adequacy of the regression model andChecking assumptions
• Is a linear model the correct model?(Is simple linear regression complex enoughto capture the relationship betweenX & Y ?)
• Are the assumptions we’re making for ourmodel reasonable, or are they violated?
• To answer these questions, we will use theresiduals of the model.
The residual for observations i:
ei = yi − ŷi
18

Residuals are informative
Consider the Price vs. Age of clock data:
1000
1400
1800
2200
125 150 175Age of Clock (yrs)
Pric
e So
ld a
t Auc
tion
5.0
7.5
10.0
12.5
15.0Bidders
Use plot of residuals vs. ŷ fitted values (below)to check adequacy of model AND constant variance assumption.
19

• If this plot is a random scatter of points aboveand below the horizontal reference line, thenthe linear model is reasonable, and adequate.
• If not (i.e. if there is a nonrandom patternin the residual plot), then there may be issues with our linearity assumption or perhaps other assumptions in our model and themodel may not be adequate.
20

• Example showing inadequacy:Kentucky Derby data set
on year of race and speed of horse.
The form of the scatterplot looks a bit nonlinear, but we’ll go ahead and fit a straightline model first to get the following residualplot...
21

Residual Plot of ‘residuals vs. fitted values’
• Residuals have a bit of a pattern (e.g. below the line, above the line, below the line),not randomly scattered above and belowthe horizontal line.
• Linear form may not be reasonable oradequate.
⇒ Quadratic may fit better.
22

Beyond Adequacy
• Besides checking that our model fits the general (linear) relationship between X and Y,we also need to consider the assumptionswe made in our model.
• The basic model
Yi = β0 + β1xi + �i︸ ︷︷ ︸ ↑linear random
relationship error term
with �iiid∼ N(0, σ2)
– Constant variance of errors(only one σ2 for all errors)
– Normality of errors
– Independence of errors
23

Constant Variance Assumption
•We’ll check this assumption by plotting theresiduals vs. the fitted values (or vs. the explanatory variable in SLR)
• Look for a constant ‘spread’ above and belowthe horizontal reference line.
• NOTE: This same residual plot was also used to checklinearity.
24

•Constant Variance and Adequacy areboth checked with the same residualplot in SLR
• Plot residuals vs. ŷ (or in SLR, against x).
25

Normality Assumption
• Use normal probability plot of residualsto check normality of errors (see section 66for nonnormal patterns like those below).
26

Independence Assumption
• Verify that the observations are independent.
• Check how the data was collected (talk tothe researcher or client).
• If data was collected over time, plot residuals against time to make sure there isn’t adependence (or trend) across time.
27

• Predictions and Extrapolation– We can use our fitted model to make pre
dictions.
– e.g.What is the expected longevity in days ofa fruitfly with a thorax of length 0.80 mm?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●● ●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
0.65 0.70 0.75 0.80 0.85 0.90 0.95
2040
6080
100
ff.data$Thorax
ff.da
ta$L
onge
vity
Ŷ = −61.05 + 144.33 x
28

Prediction:
Ŷx=0.80 = −61.05 + 144.33(0.80)= 54.414 days
– If we try to predict Y outside of the rangeof observed xvalues, we are using the modelto extrapolate (predict outside the rangeof the observed data).
– You should be very careful when using extrapolation. In general it should be avoidedas we don’t have a feel for what is goingon outside the observed range.
– Predicting Ŷ for x = 1.50 mm (which isnot a value near the observed xvalues)would be an extrapolation in this fruitflyexample.
29