new chapter 11: simple linear regression (slr) and...

of 29 /29
Chapter 11: SIMPLE LINEAR REGRESSION (SLR) AND CORRELATION Part 2: Properties, Hypothesis tests, Model adequacy and assumptions Sections 11-3, 11-4.1, 11-7 Recall the SLR estimates for β 0 1 and σ 2 : ˆ β 0 y - ˆ β 1 ¯ x ˆ β 1 = n i=1 (x i - ¯ x)(y i - ¯ y ) n i=1 (x i - ¯ x) 2 = S xy S xx ˆ σ 2 = SS E n - 2 = n i=1 (y i - ˆ y i ) 2 n - 2 = MSE These estimators are unbiased estimators (a nice characteristic): * E [ ˆ β 0 ]= β 0 * E [ ˆ β 1 ]= β 1 * E [ ˆ σ 2 ]= σ 2 1

Author: others

Post on 11-Oct-2020

2 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Chapter 11: SIMPLE LINEARREGRESSION (SLR)AND CORRELATION

    Part 2: Properties, Hypothesis tests,Model adequacy and assumptions

    Sections 11-3, 11-4.1, 11-7

    • Recall the SLR estimates for β0, β1 and σ2:β̂0 = ȳ − β̂1x̄

    β̂1 =

    ∑ni=1(xi − x̄)(yi − ȳ)∑n

    i=1(xi − x̄)2=SxySxx

    σ̂2 =SSEn− 2

    =

    ∑ni=1(yi − ŷi)2

    n− 2= MSE

    • These estimators are unbiased estimators(a nice characteristic):

    * E[β̂0] = β0* E[β̂1] = β1

    * E[σ̂2] = σ2

    1

  • What kind of variability do these least squaresestimators have?

    • Variance of β̂1 (a random variable):

    V ar(β̂1) =σ2∑n

    i=1(xi − x̄)2=

    σ2

    Sxx

    The variance of the estimated slope dependson. . . the variability of the errors σ2, theamount of data n, the spread of the x-values.

    Since we don’t know σ2, we’ll plug-in the es-

    timate σ̂2 to get a usable value of the...

    • Estimated standard error for β̂1:

    se(β̂1) =

    √σ̂2∑n

    i=1(xi−x̄)2

    2

  • • Variance of β̂0 (a random variable):

    V ar(β̂0) = σ2

    (1

    n+

    x̄2∑ni=1(xi − x̄)2

    )The variance of the estimated interceptdepends on... the error variability σ2, theamount of data n, the spread of the x-values,AND how far the center of the x-values (i.e. x̄)is from x = 0.

    We have more precision (lower variability) forestimating β0 when the data are near x=0(compared to being far from x=0).

    We’ll plug-in the estimate σ̂2 to get the...

    • Estimated standard error for β̂0

    se(β̂0) =

    √σ̂2(

    1n +

    x̄2∑ni=1(xi−x̄)2

    )3

  • Hypothesis tests for β0 and β1

    • For SLR, a common hypothesis test is thetest for a linear relationship between X and Y .

    H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0

    • Under the assumption �iiid∼ N(0, σ2), we

    have

    β̂0 ∼ N(β0, σ

    2(

    1n +

    x̄2∑ni=1(xi−x̄)2

    ))

    β̂1 ∼ N(β1,

    σ2∑ni=1(xi−x̄)2

    )• Test of interest for the slope:H0 : β1 = 0 (no linear relationship)H1 : β1 6= 0

    4

  • • Since we will be estimating σ2, we will use at-statistic:

    T0 =β̂1 − 0se(β̂1)

    =β̂1√σ̂2∑n

    i=1(xi−x̄)2

    Under H0 true, T0 ∼ tn−2.

    From our test statistic, we can compute ap-value for our hypothesis test on the slope.

    • Test of interest for the intercept:H0 : β0 = 0 vs. H1 : β0 6= 0

    The test statistic:

    T0 =β̂0 − 0se(β̂0)

    =β̂0√

    σ̂2(

    1n +

    x̄2∑ni=1(xi−x̄)2

    )Under H0 true, T0 ∼ tn−2.

    5

  • • Example: Chloride concentration in Streamsvs. Roadway area in watersheds(Problem 11-10 in book)

    An article in the Journal of EnvironmentalEngineering reported the results of a studyon the occurrence of sodium and chloride insurface streams in central Rhode Island.

    They found that watersheds with a largerpercentage of the land in roadways tended tohave higher chloride concentrations (mg/liter)in the streams.

    ●●●●

    ● ●

    ● ●●

    ●●

    0.5 1.0 1.5

    510

    1520

    2530

    3540

    roadway area in watershed (%)

    chlo

    ride

    conc

    entr

    atio

    n (m

    g/lit

    er)

    6

  • The data:

    obs PercRoadways ChlorConc

    1 0.15 6.6

    2 0.19 4.4

    3 0.47 11.8

    4 0.57 9.7

    5 0.60 14.3

    6 0.63 10.9

    7 0.67 10.8

    8 0.69 19.2

    9 0.70 10.6

    10 0.70 12.1

    11 0.78 14.7

    12 0.78 17.3

    13 0.81 15.0

    14 1.05 27.4

    15 1.06 27.7

    16 1.30 23.1

    17 1.62 39.5

    18 1.74 31.8

    n = 18

    7

  • Summary statistics:∑ni=1 xi = 14.51 x̄ = 0.8061∑ni=1 yi = 306.9 ȳ = 17.05∑ni=1 x

    2i = 14.7073

    ∑ni=1 y

    2i = 6727.13∑n

    i=1(yi − ȳ)(xi − x̄) = 61.9205∑ni=1(xi − x̄)2 = 3.0106

    The regression coefficient estimates:

    β̂1 =

    ∑ni=1(xi − x̄)(yi − ȳ)∑n

    i=1(xi − x̄)2=

    61.9205

    3.0106= 20.5675

    β̂0 = ȳ − β̂1x̄ = 17.05− 20.5675(0.8061)= 0.4705

    To estimate σ2, we need the residuals whichare denoted as ei = yi − ŷi.

    8

  • To get the residuals or ei = yi − ŷi we firstneed the fitted values (or predicted values)denoted as ŷi...

    ŷi = 0.4705 + 20.5675(xi)

    Above is the fitted model or fitted line.

    Below, we add the residuals (RESI1) andfitted values (FITS1) to our data set...

    9

  • σ̂2 = MSE =SSEn− 2

    =

    ∑ni=1(yi − ŷi)2

    n− 2

    =220.9472

    16= 13.8092

    and σ̂ =√

    13.8092 = 3.7161

    The fitted model: ŷi = 0.4705+20.5675(xi)

    10

  • Interpretation of regression coefficients:

    For any SLR analysis, β̂1 is the estimatedslope. It represents the expected change inY for a 1 unit change in X .

    β̂1 =riserun =

    4Y4X =

    β̂1 units of Y1 unit of X

    0.5 1.0 1.5

    510

    1520

    2530

    3540

    x

    y

    1

    β̂1

    11

  • Interpretation of regression coefficients:

    slope: β̂1 = 20.5675 =20.5675

    1 =riserun =

    4Y4X

    A 1 percentage point increase in the amountof land in roadways is associated with anincrease of 20.5675 mg/liter in the meanchloride concentration.

    ●●●●

    ● ●

    ● ●●

    ●●

    0.5 1.0 1.5

    510

    1520

    2530

    3540

    roadway area in watershed (%)

    chlo

    ride

    conc

    entr

    atio

    n (m

    g/lit

    er)

    1

    20.5675

    12

  • Interpretation of regression coefficients:

    Intercept: β̂0 = 0.4705

    When 0% of the watershed is in roadways,the expected chloride concentration is 0.4705mg/liter (see how this relates to the hy-pothesis test for β0 in the next slides).

    0.0 0.5 1.0 1.5

    010

    2030

    40

    roadway area in watershed (%)

    chlo

    ride

    conc

    entr

    atio

    n (m

    g/lit

    er)

    ●●●●

    ● ●

    ● ●●

    ●●

    13

  • •When providing regression COEFFICIENTINTERPRETATION, YOU MUST in-clude the relevant units for X and Y , andput it in the context of the problem.

    MINITAB output from this example:

    Regression Analysis: ChlorConc vs PercRoadways

    Regression Equation

    ChlorConc = 0.47 + 20.6 PercRoadways

    Coefficients

    Term Coef SE Coef T-Value P-Value

    Constant 0.470 1.94 0.24 0.811

    PercRoadways 20.567 2.14 9.60 0.000

    Model Summary

    S = 3.71607 R-sq = 85.22%

    • Testing for a linear relationship between chlo-ride concentration (Y ) and % of watershedin roadways (X).

    H0 : β1 = 0H1 : β1 6= 014

  • Slope estimate and standard error:

    β̂1 = 20.567 se(β̂1) =√

    13.80923.0106 = 2.1417

    Test statistic:

    t0 =β̂1 − 0se(β̂1)

    =20.567

    2.1417= 9.603

    Under H0 true, T0 ∼ t16

    P-value: 2× P (T0 > 9.603) = 4.81× 10−8{very small}

    Reject H0.

    There IS statistically significant evidence thatthe slope is not 0, so there is evidence of alinear relationship between chloride concen-tration and % of watershed in roadways.

    15

  • • Similarly, we can run a hypothesis test thatthe intercept equals 0...

    H0 : β0 = 0H1 : β0 6= 0

    Estimates:β̂0 = 0.4705

    se(β̂0) =

    √13.8092

    (118 +

    0.806123.0106

    )= 1.9358

    Test statistic:

    t0 =β̂0 − 0se(β̂0)

    =0.4705

    1.9358= 0.2431

    Under H0 true, T0 ∼ t16

    P-value: 2× P (T0 > 0.2431) = 0.8110

    16

  • Fail to reject H0.

    This intercept or β0 is not significantlydifferent than zero, suggesting that whenthere’s no roadways in a watershed, there’sno real evidence against the chloride concen-tration in the streams being zero.

    We do not have evidence to suggest the in-tercept is anything other than zero. (So, awatershed with no roadways essentially has a chloride

    concentration of 0 mg/liter.)

    0.0 0.5 1.0 1.5

    010

    2030

    40

    roadway area in watershed (%)

    chlo

    ride

    conc

    entr

    atio

    n (m

    g/lit

    er)

    ●●●●

    ● ●

    ● ●●

    ●●

    17

  • Adequacy of the regression model andChecking assumptions

    • Is a linear model the correct model?(Is simple linear regression complex enoughto capture the relationship betweenX & Y ?)

    • Are the assumptions we’re making for ourmodel reasonable, or are they violated?

    • To answer these questions, we will use theresiduals of the model.

    The residual for observations i:

    ei = yi − ŷi

    18

  • Residuals are informative

    Consider the Price vs. Age of clock data:

    1000

    1400

    1800

    2200

    125 150 175Age of Clock (yrs)

    Pric

    e So

    ld a

    t Auc

    tion

    5.0

    7.5

    10.0

    12.5

    15.0Bidders

    Use plot of residuals vs. ŷ fitted values (below)to check adequacy of model AND constant vari-ance assumption.

    19

  • • If this plot is a random scatter of points aboveand below the horizontal reference line, thenthe linear model is reasonable, and adequate.

    • If not (i.e. if there is a non-random patternin the residual plot), then there may be is-sues with our linearity assumption or per-haps other assumptions in our model and themodel may not be adequate.

    20

  • • Example showing inadequacy:Kentucky Derby data set

    on year of race and speed of horse.

    The form of the scatterplot looks a bit non-linear, but we’ll go ahead and fit a straightline model first to get the following residualplot...

    21

  • Residual Plot of ‘residuals vs. fitted values’

    • Residuals have a bit of a pattern (e.g. be-low the line, above the line, below the line),not randomly scattered above and belowthe horizontal line.

    • Linear form may not be reasonable oradequate.

    ⇒ Quadratic may fit better.

    22

  • Beyond Adequacy

    • Besides checking that our model fits the gen-eral (linear) relationship between X and Y,we also need to consider the assumptionswe made in our model.

    • The basic model

    Yi = β0 + β1xi + �i︸ ︷︷ ︸ ↑linear random

    relationship error term

    with �iiid∼ N(0, σ2)

    – Constant variance of errors(only one σ2 for all errors)

    – Normality of errors

    – Independence of errors

    23

  • Constant Variance Assumption

    •We’ll check this assumption by plotting theresiduals vs. the fitted values (or vs. the ex-planatory variable in SLR)

    • Look for a constant ‘spread’ above and belowthe horizontal reference line.

    • NOTE: This same residual plot was also used to checklinearity.

    24

  • •Constant Variance and Adequacy areboth checked with the same residualplot in SLR

    • Plot residuals vs. ŷ (or in SLR, against x).

    25

  • Normality Assumption

    • Use normal probability plot of residualsto check normality of errors (see section 6-6for non-normal patterns like those below).

    26

  • Independence Assumption

    • Verify that the observations are independent.

    • Check how the data was collected (talk tothe researcher or client).

    • If data was collected over time, plot residu-als against time to make sure there isn’t adependence (or trend) across time.

    27

  • • Predictions and Extrapolation– We can use our fitted model to make pre-

    dictions.

    – e.g.What is the expected longevity in days ofa fruitfly with a thorax of length 0.80 mm?

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    0.65 0.70 0.75 0.80 0.85 0.90 0.95

    2040

    6080

    100

    ff.data$Thorax

    ff.da

    ta$L

    onge

    vity

    Ŷ = −61.05 + 144.33 x

    28

  • Prediction:

    Ŷx=0.80 = −61.05 + 144.33(0.80)= 54.414 days

    – If we try to predict Y outside of the rangeof observed x-values, we are using the modelto extrapolate (predict outside the rangeof the observed data).

    – You should be very careful when using ex-trapolation. In general it should be avoidedas we don’t have a feel for what is goingon outside the observed range.

    – Predicting Ŷ for x = 1.50 mm (which isnot a value near the observed x-values)would be an extrapolation in this fruitflyexample.

    29