chapter 11: simple linear regression and correlation...

of 23 /23
Chapter 11: SIMPLE LINEAR REGRESSION AND CORRELATION Part 1: Simple Linear Regression (SLR) Introduction Sections 11-1 and 11-2 Abrasion Loss vs. Hardness Price of clock vs. Age of clock 1000 1400 1800 2200 125 150 175 Age of Clock (yrs) Price Sold at Auction 5.0 7.5 10.0 12.5 15.0 Bidders 1

Author: truongtram

Post on 12-May-2018

233 views

Category:

Documents


1 download

Embed Size (px)

TRANSCRIPT

  • Chapter 11: SIMPLE LINEAR REGRESSIONAND CORRELATION

    Part 1: Simple Linear Regression (SLR)

    Introduction

    Sections 11-1 and 11-2

    Abrasion Loss vs. Hardness

    Price of clock vs. Age of clock

    1000

    1400

    1800

    2200

    125 150 175Age of Clock (yrs)

    Pric

    e So

    ld a

    t Auc

    tion

    5.0

    7.5

    10.0

    12.5

    15.0Bidders

    1

  • Regression is a method for studying therelationship between two or morequantitative variables

    Simple linear regression (SLR):One quantitative dependent variable

    - response variable

    - dependent variable

    - Y

    One quantitative independent variable

    - explanatory variable

    - predictor variable

    - X

    Multiple linear regression:One quantitative dependent variable

    Many quantitative independent variables

    Youll see this in STAT:3200/IE:3760Applied Linear Regression, if you take it.

    2

  • SLR Examples: predict salary from years of experience

    estimate effect of lead exposure on schooltesting performance

    predict force at which a metal alloy rodbends based on iron content

    3

  • Example: Health dataVariables:

    Percent of Obese IndividualsPercent of Active Individuals

    Data from CDC. Units are regions of U.S. in 2014.

    PercentObesity PercentActive

    1 29.7 55.3

    2 28.9 51.9

    3 35.9 41.2

    4 24.7 56.3

    5 21.3 60.4

    6 26.3 50.9

    .

    .

    .

    40 45 50 55 60 65

    2530

    35

    Percent Active

    Per

    cent

    obe

    se

    4

  • A scatterplot or scatter diagram can give us ageneral idea of the relationship between obe-sity and activity...

    40 45 50 55 60 65

    2530

    35

    Percent Active

    Per

    cent

    obe

    se

    The points are plotted as the pairs (xi, yi)for i = 1, . . . , 25

    Inspection suggests a linear relationship be-tween obesity and activity (i.e. a straight linewould go through the bulk of the points, andthe points would look randomly scattered aroundthis line).

    5

  • Simple Linear RegressionThe model

    The basic model

    Yi = 0 + 1xi + i

    Yi is the observed response or dependentvariable for observation i

    xi is the observed predictor, regressor, ex-planatory variable, independent variable,covariate

    i is the error term

    i are iid N(0, 2)

    (iid means independently and identically distributed)

    6

  • So, E[Yi|xi] = 0 + 1xi + 0 = 0 + 1xi

    The conditional mean (i.e. the expectedvalue of Yi given xi, or after conditioningon xi) is 0 + 1xi (a point on the esti-mated line).

    Or, as another notation, E[Y |x] = Y |x

    The random scatter around the mean (i.e.around the line) follows a N(0, 2) distri-bution.

    7

  • Example: Consider the model that re-gresses Oxygen purity on Hydrocarbon levelin a distillation process with...

    0 = 75 and 1 = 15

    For each xi there is a different Oxygen pu-rity mean (which is the center of a normaldistribution of Oxygen purity values).

    Plugging in xi to (75+15xi) gives you theconditional mean at xi.

    8

  • The conditional mean for x = 1:

    E[Y |x] = 75 + 15 1 = 90

    The conditional mean for x = 1.25:

    E[Y |x] = 75 + 15 1.25 = 93.75

    9

  • These values that randomly scatter around aconditional mean are called errors.

    The random error of observation i is denotedas i. The errors around a conditional meanare normally distributed, centered at 0, andhave a variance of 2 or i N (0,).

    Here, we assume all the conditional distri-butions of the errors are the same, so wereusing a constant variance model.

    V [Yi|xi] = V (0 + 1xi + i) = V (i) = 2

    10

  • The model can also be written as:

    Yi|xi N(0 + 1xi , 2) Conditional

    mean

    mean of Y given x is 0 + 1x (known asconditional mean)

    0 + 1xi is the mean value of all theY s for the given value of xi

    The regression line itself represents all theconditional means.

    All the observed points will not fall on theline, there is some random noise around themean (we model this part with an error term).

    Usually, we will not know 0, 1, or 2 so

    we will estimate them from the data.11

  • Some interpretation of parameters:

    0 is conditional mean when x=0

    1 is the slope, also stated as the changein mean of Y per 1 unit change in x

    2 is the variability of responses about theconditional mean

    12

  • Simple Linear RegressionAssumptions

    Key assumptions

    linear relationship exists between Y and x

    *we say the relationship between Y andx is linear if the means of the conditionaldistributions of Y |x lie on a straight line

    independent errors(this essentially equates to independent

    observations in the case of SLR)

    constant variance of errors

    normally distributed errors

    13

  • Simple Linear RegressionEstimation

    We wish to use the sample data to estimate thepopulation parameters: the slope 1 and theintercept 0

    Least squares estimation

    To choose the best fitting line using leastsquares estimation, we minimize the sumof the squared vertical distances of eachpoint to the fitted line.

    14

  • We let hats denote predicted values orestimates of parameters, so we have:

    yi = 0 + 1xi

    where yi is the estimated conditional meanfor xi,

    0 is the estimator for 0,

    and 1 is the estimator for 1

    We wish to choose 0 and 1 such that weminimize the sum of the squared verticaldistances of each point to the fitted line,i.e. minimize

    ni=1(yi yi)2

    Or minimize the function g:

    g(0, 1) =ni=1(yi yi)2

    =ni=1(yi (0 + 1xi))215

  • This vertical distance of a point from thefitted line is called a residual. The resid-ual for observation i is denoted ei and

    ei = yi yi

    So, in least squares estimation, we wishto minimize the sum of the squaredresiduals (or error sum of squares SSE).

    To minimizeg(0, 1) =

    ni=1(yi (0 + 1xi))2

    we take the derivative of g with respect to0 and 1, set equal to zero, and solve.

    g

    0= 2

    ni=1

    (yi (0 + 1xi)) = 0

    g

    1= 2

    ni=1

    (yi (0 + 1xi))xi = 0

    16

  • Simplifying the above gives:

    n0 + 1

    ni=1

    xi =ni=1

    yi

    0

    ni=1

    xi + 1

    ni=1

    (x2i ) =

    ni=1

    yixi

    And these two equations are known asthe least squares normal equations.

    Solving the normal equations gets us ourestimators 0 and 1...

    17

  • Simple Linear RegressionEstimation

    Estimate of the slope:

    1 =

    ni=1(xi x)(yi y)n

    i=1(xi x)2=SxySxx

    Estimate of the Y -intercept:

    0 = y 1x

    the point (x, y) will always be on theleast squares line

    Alternative formulas for 0 and 1 are alsogiven in the book.

    18

  • Example: Cigarette data(Nicotine vs. Tar content)

    0 5 10 15 20 25 30

    0.5

    1.0

    1.5

    2.0

    Tar

    Nic

    n = 25

    Least squares estimates from software:

    0=0.1309 and 1=0.0610

    Summary statistics:ni=1 xi = 305.4 x = 12.216ni=1 yi = 21.91 y = 0.8764

    19

  • ni=1(yi y)(xi x) = 47.01844ni=1(xi x)2 = 770.4336ni=1 x

    2i = 4501.2

    ni=1 y

    2i = 22.2105

    Using the previous formulas and the sum-mary statistics...

    1 =SxySxx

    =47.01844

    770.4336= 0.061029

    and

    0 = y 1x

    = 0.8764 0.061029(12.216)

    = 0.130870

    (Same estimates as software)

    20

  • Simple Linear RegressionEstimating 2

    One of the assumptions of simple linear re-gression is that the variance for each of theconditional distributions of Y |x is the sameat all x-values (i.e. constant variance).

    In this case, it makes sense to pool all theobserved error information (in the residuals)to come up with a common estimate for 2

    21

  • Recall the model:

    Yi = 0 + 1xi + i with iiid N(0, 2)

    We use the error sum of squares (SSE)to estimate 2...

    2 =SSEn 2

    =

    ni=1(yi yi)2

    n 2= MSE

    SSE = error sum of squares=ni=1(yi yi)2

    MSE is the mean squared error

    E[MSE] = E[2] = 2 (Unbiasedestimator)

    =2 =

    MSE

    22

  • 2 is subtracted from n in the denom-inator because weve used 2 degrees offreedom for estimating the slope and in-tercept (i.e. there were 2 parameters es-timated when modeling the conditionalmean)

    When we estimated 2 in a single nor-mal population, we divideni=1(yi yi)2 by (n 1) because

    we only estimated 1 mean structure pa-rameter which was , now were esti-mate two parameters for our mean struc-ture, 0 and 1.

    23