new chapter 11: simple linear regression (slr) and...
Embed Size (px)
TRANSCRIPT
-
Chapter 11: SIMPLE LINEARREGRESSION (SLR)AND CORRELATION
Part 2: Properties, Hypothesis tests,Model adequacy and assumptions
Sections 11-3, 11-4.1, 11-7
• Recall the SLR estimates for β0, β1 and σ2:β̂0 = ȳ − β̂1x̄
β̂1 =
∑ni=1(xi − x̄)(yi − ȳ)∑n
i=1(xi − x̄)2=SxySxx
σ̂2 =SSEn− 2
=
∑ni=1(yi − ŷi)2
n− 2= MSE
• These estimators are unbiased estimators(a nice characteristic):
* E[β̂0] = β0* E[β̂1] = β1
* E[σ̂2] = σ2
1
-
What kind of variability do these least squaresestimators have?
• Variance of β̂1 (a random variable):
V ar(β̂1) =σ2∑n
i=1(xi − x̄)2=
σ2
Sxx
The variance of the estimated slope dependson. . . the variability of the errors σ2, theamount of data n, the spread of the x-values.
Since we don’t know σ2, we’ll plug-in the es-
timate σ̂2 to get a usable value of the...
• Estimated standard error for β̂1:
se(β̂1) =
√σ̂2∑n
i=1(xi−x̄)2
2
-
• Variance of β̂0 (a random variable):
V ar(β̂0) = σ2
(1
n+
x̄2∑ni=1(xi − x̄)2
)The variance of the estimated interceptdepends on... the error variability σ2, theamount of data n, the spread of the x-values,AND how far the center of the x-values (i.e. x̄)is from x = 0.
We have more precision (lower variability) forestimating β0 when the data are near x=0(compared to being far from x=0).
We’ll plug-in the estimate σ̂2 to get the...
• Estimated standard error for β̂0
se(β̂0) =
√σ̂2(
1n +
x̄2∑ni=1(xi−x̄)2
)3
-
Hypothesis tests for β0 and β1
• For SLR, a common hypothesis test is thetest for a linear relationship between X and Y .
H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0
• Under the assumption �iiid∼ N(0, σ2), we
have
β̂0 ∼ N(β0, σ
2(
1n +
x̄2∑ni=1(xi−x̄)2
))
β̂1 ∼ N(β1,
σ2∑ni=1(xi−x̄)2
)• Test of interest for the slope:H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0
4
-
• Since we will be estimating σ2, we will use at-statistic:
T0 =β̂1 − 0se(β̂1)
=β̂1√σ̂2∑n
i=1(xi−x̄)2
Under H0 true, T0 ∼ tn−2.
From our test statistic, we can compute ap-value for our hypothesis test on the slope.
• Test of interest for the intercept:H0 : β0 = 0 vs. H1 : β0 6= 0
The test statistic:
T0 =β̂0 − 0se(β̂0)
=β̂0√
σ̂2(
1n +
x̄2∑ni=1(xi−x̄)2
)Under H0 true, T0 ∼ tn−2.
5
-
• Example: Chloride concentration in Streamsvs. Roadway area in watersheds(Problem 11-10 in book)
An article in the Journal of EnvironmentalEngineering reported the results of a studyon the occurrence of sodium and chloride insurface streams in central Rhode Island.
They found that watersheds with a largerpercentage of the land in roadways tended tohave higher chloride concentrations (mg/liter)in the streams.
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
0.5 1.0 1.5
510
1520
2530
3540
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
6
-
The data:
obs PercRoadways ChlorConc
1 0.15 6.6
2 0.19 4.4
3 0.47 11.8
4 0.57 9.7
5 0.60 14.3
6 0.63 10.9
7 0.67 10.8
8 0.69 19.2
9 0.70 10.6
10 0.70 12.1
11 0.78 14.7
12 0.78 17.3
13 0.81 15.0
14 1.05 27.4
15 1.06 27.7
16 1.30 23.1
17 1.62 39.5
18 1.74 31.8
n = 18
7
-
Summary statistics:∑ni=1 xi = 14.51 x̄ = 0.8061∑ni=1 yi = 306.9 ȳ = 17.05∑ni=1 x
2i = 14.7073
∑ni=1 y
2i = 6727.13∑n
i=1(yi − ȳ)(xi − x̄) = 61.9205∑ni=1(xi − x̄)2 = 3.0106
The regression coefficient estimates:
β̂1 =
∑ni=1(xi − x̄)(yi − ȳ)∑n
i=1(xi − x̄)2=
61.9205
3.0106= 20.5675
β̂0 = ȳ − β̂1x̄ = 17.05− 20.5675(0.8061)= 0.4705
To estimate σ2, we need the residuals whichare denoted as ei = yi − ŷi.
8
-
To get the residuals or ei = yi − ŷi we firstneed the fitted values (or predicted values)denoted as ŷi...
ŷi = 0.4705 + 20.5675(xi)
Above is the fitted model or fitted line.
Below, we add the residuals (RESI1) andfitted values (FITS1) to our data set...
9
-
σ̂2 = MSE =SSEn− 2
=
∑ni=1(yi − ŷi)2
n− 2
=220.9472
16= 13.8092
and σ̂ =√
13.8092 = 3.7161
The fitted model: ŷi = 0.4705+20.5675(xi)
10
-
Interpretation of regression coefficients:
For any SLR analysis, β̂1 is the estimatedslope. It represents the expected change inY for a 1 unit change in X .
β̂1 =riserun =
4Y4X =
β̂1 units of Y1 unit of X
0.5 1.0 1.5
510
1520
2530
3540
x
y
1
β̂1
11
-
Interpretation of regression coefficients:
slope: β̂1 = 20.5675 =20.5675
1 =riserun =
4Y4X
A 1 percentage point increase in the amountof land in roadways is associated with anincrease of 20.5675 mg/liter in the meanchloride concentration.
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
0.5 1.0 1.5
510
1520
2530
3540
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
1
20.5675
12
-
Interpretation of regression coefficients:
Intercept: β̂0 = 0.4705
When 0% of the watershed is in roadways,the expected chloride concentration is 0.4705mg/liter (see how this relates to the hy-pothesis test for β0 in the next slides).
0.0 0.5 1.0 1.5
010
2030
40
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
13
-
•When providing regression COEFFICIENTINTERPRETATION, YOU MUST in-clude the relevant units for X and Y , andput it in the context of the problem.
MINITAB output from this example:
Regression Analysis: ChlorConc vs PercRoadways
Regression Equation
ChlorConc = 0.47 + 20.6 PercRoadways
Coefficients
Term Coef SE Coef T-Value P-Value
Constant 0.470 1.94 0.24 0.811
PercRoadways 20.567 2.14 9.60 0.000
Model Summary
S = 3.71607 R-sq = 85.22%
• Testing for a linear relationship between chlo-ride concentration (Y ) and % of watershedin roadways (X).
H0 : β1 = 0H1 : β1 6= 014
-
Slope estimate and standard error:
β̂1 = 20.567 se(β̂1) =√
13.80923.0106 = 2.1417
Test statistic:
t0 =β̂1 − 0se(β̂1)
=20.567
2.1417= 9.603
Under H0 true, T0 ∼ t16
P-value: 2× P (T0 > 9.603) = 4.81× 10−8{very small}
Reject H0.
There IS statistically significant evidence thatthe slope is not 0, so there is evidence of alinear relationship between chloride concen-tration and % of watershed in roadways.
15
-
• Similarly, we can run a hypothesis test thatthe intercept equals 0...
H0 : β0 = 0H1 : β0 6= 0
Estimates:β̂0 = 0.4705
se(β̂0) =
√13.8092
(118 +
0.806123.0106
)= 1.9358
Test statistic:
t0 =β̂0 − 0se(β̂0)
=0.4705
1.9358= 0.2431
Under H0 true, T0 ∼ t16
P-value: 2× P (T0 > 0.2431) = 0.8110
16
-
Fail to reject H0.
This intercept or β0 is not significantlydifferent than zero, suggesting that whenthere’s no roadways in a watershed, there’sno real evidence against the chloride concen-tration in the streams being zero.
We do not have evidence to suggest the in-tercept is anything other than zero. (So, awatershed with no roadways essentially has a chloride
concentration of 0 mg/liter.)
0.0 0.5 1.0 1.5
010
2030
40
roadway area in watershed (%)
chlo
ride
conc
entr
atio
n (m
g/lit
er)
●
●
●●●●
● ●
● ●●
●
●
●
●●
●
●
17
-
Adequacy of the regression model andChecking assumptions
• Is a linear model the correct model?(Is simple linear regression complex enoughto capture the relationship betweenX & Y ?)
• Are the assumptions we’re making for ourmodel reasonable, or are they violated?
• To answer these questions, we will use theresiduals of the model.
The residual for observations i:
ei = yi − ŷi
18
-
Residuals are informative
Consider the Price vs. Age of clock data:
1000
1400
1800
2200
125 150 175Age of Clock (yrs)
Pric
e So
ld a
t Auc
tion
5.0
7.5
10.0
12.5
15.0Bidders
Use plot of residuals vs. ŷ fitted values (below)to check adequacy of model AND constant vari-ance assumption.
19
-
• If this plot is a random scatter of points aboveand below the horizontal reference line, thenthe linear model is reasonable, and adequate.
• If not (i.e. if there is a non-random patternin the residual plot), then there may be is-sues with our linearity assumption or per-haps other assumptions in our model and themodel may not be adequate.
20
-
• Example showing inadequacy:Kentucky Derby data set
on year of race and speed of horse.
The form of the scatterplot looks a bit non-linear, but we’ll go ahead and fit a straightline model first to get the following residualplot...
21
-
Residual Plot of ‘residuals vs. fitted values’
• Residuals have a bit of a pattern (e.g. be-low the line, above the line, below the line),not randomly scattered above and belowthe horizontal line.
• Linear form may not be reasonable oradequate.
⇒ Quadratic may fit better.
22
-
Beyond Adequacy
• Besides checking that our model fits the gen-eral (linear) relationship between X and Y,we also need to consider the assumptionswe made in our model.
• The basic model
Yi = β0 + β1xi + �i︸ ︷︷ ︸ ↑linear random
relationship error term
with �iiid∼ N(0, σ2)
– Constant variance of errors(only one σ2 for all errors)
– Normality of errors
– Independence of errors
23
-
Constant Variance Assumption
•We’ll check this assumption by plotting theresiduals vs. the fitted values (or vs. the ex-planatory variable in SLR)
• Look for a constant ‘spread’ above and belowthe horizontal reference line.
• NOTE: This same residual plot was also used to checklinearity.
24
-
•Constant Variance and Adequacy areboth checked with the same residualplot in SLR
• Plot residuals vs. ŷ (or in SLR, against x).
25
-
Normality Assumption
• Use normal probability plot of residualsto check normality of errors (see section 6-6for non-normal patterns like those below).
26
-
Independence Assumption
• Verify that the observations are independent.
• Check how the data was collected (talk tothe researcher or client).
• If data was collected over time, plot residu-als against time to make sure there isn’t adependence (or trend) across time.
27
-
• Predictions and Extrapolation– We can use our fitted model to make pre-
dictions.
– e.g.What is the expected longevity in days ofa fruitfly with a thorax of length 0.80 mm?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●● ●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
0.65 0.70 0.75 0.80 0.85 0.90 0.95
2040
6080
100
ff.data$Thorax
ff.da
ta$L
onge
vity
Ŷ = −61.05 + 144.33 x
28
-
Prediction:
Ŷx=0.80 = −61.05 + 144.33(0.80)= 54.414 days
– If we try to predict Y outside of the rangeof observed x-values, we are using the modelto extrapolate (predict outside the rangeof the observed data).
– You should be very careful when using ex-trapolation. In general it should be avoidedas we don’t have a feel for what is goingon outside the observed range.
– Predicting Ŷ for x = 1.50 mm (which isnot a value near the observed x-values)would be an extrapolation in this fruitflyexample.
29