simple linear regression[1]

Author: rangothri-sreenivasa-subramanyam

Post on 14-Apr-2018

234 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • 7/30/2019 Simple Linear Regression[1]

    1/50

    1

    Linear Regressionand

    Correlation Analysis

  • 7/30/2019 Simple Linear Regression[1]

    2/50

    2

    Chapter Goals

    To understand the methods for displaying and describingrelationship among two variables

    Models

    Linear regression

    Correlations

    Frequency tables

    Methods for Studying Relationships

    Graphical

    Scatter plots

    Line plots 3-D plots

  • 7/30/2019 Simple Linear Regression[1]

    3/50

    3

    Two Quantitative Variables

    The response var iable, also called thedependent var iable, is the variable we wantto predict, and is usually denoted by y.

    Theexplanatory variab le, also called theindependent var iable, is the variable thatattempts to explain the response, and isdenoted by x.

  • 7/30/2019 Simple Linear Regression[1]

    4/50

    4

    Scatter Plots and Correlation

    A scatter plot (or scatter diagram) is usedto show the relationship between twovariables

    Correlation analysis is used to measurestrength of the association (linearrelationship) between two variables

    Only concerned with strength of therelationship

    No causal effect is implied

  • 7/30/2019 Simple Linear Regression[1]

    5/50

    5

    Example

    The following graphshows the scatter plotof Exam 1 score (x)and Exam 2 score (y)for 354 students in aclass.

    Is there a relationshipbetween x and y?

  • 7/30/2019 Simple Linear Regression[1]

    6/50

    6

    Scatter Plot Examples

    y

    x

    y

    x

    y

    y

    x

    x

    Linear relationships Curvilinear relationships

  • 7/30/2019 Simple Linear Regression[1]

    7/50

    7

    Scatter Plot Examples

    y

    x

    y

    x

    No relationship

    (continued)

  • 7/30/2019 Simple Linear Regression[1]

    8/50

    8

    Correlation Coefficient

    The population correlation coefficient (rho) measures the strength of theassociation between the variables

    The sample correlation coefficient ris an estimate of and is used tomeasure the strength of the linear

    relationship in the sample observations

    (continued)

  • 7/30/2019 Simple Linear Regression[1]

    9/50

    9

    Features of and r Unit free

    Range between -1 and 1

    The closer to -1, the stronger thenegative linear relationship

    The closer to 1, the stronger the positivelinear relationship

    The closer to 0, the weaker the linearrelationship

  • 7/30/2019 Simple Linear Regression[1]

    10/50

    10

    Examples of Approximate r Values

    yy

    x

    y

    x

    x

    y

    x

    y

    x

    Tag with appropriate value: -1, -.6, 0, +.3, 1

  • 7/30/2019 Simple Linear Regression[1]

    11/50

    11

    Earlier Example

    Correlations

    1 .400**

    .000

    366 351

    .400** 1

    .000

    351 356

    Pearson Correlation

    Sig. (2-tailed)

    N

    Pearson Correlation

    Sig. (2-tailed)

    N

    Exam1

    Exam2

    Exam 1 Exam 2

    Correlation is significant at the 0.01 level**.

  • 7/30/2019 Simple Linear Regression[1]

    12/50

    12

    Questions?

    What kind of relationship would you expectin the following situations:

    Age (in years) of a car, and its price.

    Number of calories consumed per day andweight.

    Height and IQ of a person.

  • 7/30/2019 Simple Linear Regression[1]

    13/50

    13

    Exercise

    Identify the two variables that vary and decidewhich should be the independent variable andwhich should be the dependent variable.Sketch a graph that you think best representsthe relationship between the two variables.

    1. The size of a persons vocabulary over his orher lifetime.

    2. The distance from the ceiling to the tip of theminute hand of a clock hung on the wall.

  • 7/30/2019 Simple Linear Regression[1]

    14/50

    14

    Introduction to Regression Analysis

    Regression analysis is used to:

    Predict the value of a dependent variable basedon the value of at least one independent variable.

    Explain the impact of changes in an independentvariable on the dependent variable.

    Dependent variable: the variable we wish to

    explain.Independent variable: the variable used to

    explain the dependent variable.

  • 7/30/2019 Simple Linear Regression[1]

    15/50

    15

    Simple Linear RegressionModel

    Only one independent variable, x.

    Relationship between x and y isdescribed by a linear function.

    Changes in y are assumed to becaused by changes in x.

  • 7/30/2019 Simple Linear Regression[1]

    16/50

    16

    Types of Regression ModelsPositive Linear Relationship

    Negative Linear Relationship

    Relationship NOT Linear

    No Relationship

  • 7/30/2019 Simple Linear Regression[1]

    17/50

    17

    ni ,...,2,1; ii10i XY

    Linear component

    Population Linear Regression

    The population regression model:

    Populationy intercept

    PopulationSlopeCoefficient

    RandomError term,

    or residual

    Dependent

    Variable

    IndependentVariable

    Random Errorcomponent

  • 7/30/2019 Simple Linear Regression[1]

    18/50

    18

    Linear Regression Assumptions

    The Error terms i, i=1, 2. , n areindependent and i ~ Normal (0,

    2).

    The Error terms i, i=1, 2. , n have constantvariance 2.

    The underlying relationship between the Xvariable and the Y variable is linear.

  • 7/30/2019 Simple Linear Regression[1]

    19/50

    19

    Population Linear Regression(continued)

    Random Errorfor this xi value

    Y

    X

    Observed Valueof yi for xi

    Predicted Valueof yi for xi

    ii10i XY

    xi

    Slope = 1

    Y-Intercept = 0

    i

  • 7/30/2019 Simple Linear Regression[1]

    20/50

    20

    i10i XbbY

    The sample regression line provides an estimateof the population regression function.

    Estimated Regression Model

    Estimate ofthe regressiony-intercept

    Estimate of theregression slopeEstimated

    (or predicted)y value Independent

    variable

    The individual random error terms ei have a mean of zero

    The Regression Function: iiii XEXYE 1010 )()(

  • 7/30/2019 Simple Linear Regression[1]

    21/50

    21

    Earlier Example

  • 7/30/2019 Simple Linear Regression[1]

    22/50

    22

    ResidualA residualis the difference between the observed

    response yiand the predicted response i. Thus, for

    each pair of observations (xi, yi), the ith residual isei= yi i= yi (b0+ b1xi)

    Least Squares Criterion b0 and b1 are obtained by finding the

    values of b0 and b1 that minimize the sumof the squared residuals.

    2

    10

    22

    x))b(b(y

    )y(ye

  • 7/30/2019 Simple Linear Regression[1]

    23/50

    23

    b0 is the estimated average value of

    y when the value of x is zero.

    b1 is the estimated change in the

    average value of y as a result of a

    one-unit change in x.

    Interpretation of theSlope and the Intercept

  • 7/30/2019 Simple Linear Regression[1]

    24/50

    24

    The Least Squares Equation

    The formulas for b1

    and b0

    are:

    algebraic equivalent:

    nxx

    n

    yxxy

    b2

    2

    1

    )(

    21 )(

    ))((

    xx

    yyxxb

    xbyb 10 and

  • 7/30/2019 Simple Linear Regression[1]

    25/50

    25

    Finding the Least Squares Equation

    The coefficients b0 and b1 willusually be found using computersoftware, such as Excel, Minitab, or

    SPSS.

    Other regression measures will alsobe computed as part of computer-based regression analysis.

    Si l Li R i

  • 7/30/2019 Simple Linear Regression[1]

    26/50

    26

    Simple Linear RegressionExample

    A real estate agent wishes to examine therelationship between the selling price of a

    home and its size (measured in square feet)

    A random sample of 10 houses is selected

    Dependent variable (y) = house price in

    $1000s

    Independent variable (x) = square feet

  • 7/30/2019 Simple Linear Regression[1]

    27/50

    27

    Sample Data for House PriceModel

    House Price in $1000s(y)

    Square Feet(x)

    245 1400

    312 1600

    279 1700

    308 1875

    199 1100

    219 1550405 2350

    324 2450

    319 1425

    255 1700

  • 7/30/2019 Simple Linear Regression[1]

    28/50

    28

    Model Summary

    .762a .581 .528 41.33032Model1

    R R Square

    Adjusted

    R Square

    Std. Error of

    the Estimate

    Predictors: (Constant), Square Feeta.

    Coefficientsa

    98.248 58.033 1.693 .129

    .110 .033 .762 3.329 .010

    (Constant)

    Square Feet

    Model

    1

    B Std. Error

    UnstandardizedCoefficients

    Beta

    StandardizedCoefficients

    t Sig.

    Dependent Variable: House Pricea.

    SPSS Output

    The regression equation is:feet)(square0.11098.248pricehouse

  • 7/30/2019 Simple Linear Regression[1]

    29/50

    29

    0

    50

    100

    150

    200

    250300

    350

    400

    450

    0 500 1000 1500 2000 2500 3000

    Square Feet

    HousePrice($1

    000s)

    Graphical Presentation

    House price model: scatter plot andregression line

    feet)(square0.11098.248pricehouse

    Slope

    = 0.110

    Intercept

    = 98.248

  • 7/30/2019 Simple Linear Regression[1]

    30/50

    30

    Interpretation of the Intercept, b0

    b0 is the estimated average value of Y when

    the value of X is zero (if x = 0 is in the range

    of observed x values)

    Here, no houses had 0 square feet, so

    b0 = 98.24833 just indicates that, for houses

    within the range of sizes observed,

    $98,248.33 is the portion of the house price

    not explained by square feet

    feet)(square0.11098.248pricehouse

    Interpretation of the

  • 7/30/2019 Simple Linear Regression[1]

    31/50

    31

    Interpretation of theSlope Coefficient, b1

    b1 measures the estimated change in

    the average value of Y as a result ofa one-unit change in X

    Here, b1 = .10977 tells us that the averagevalue of a house increases by .10977($1000)

    = $109.77, on average, for each additional

    one square foot of size

    feet)(square0.1097798.24833pricehouse

  • 7/30/2019 Simple Linear Regression[1]

    32/50

    32

    Least Squares Regression Properties

    The sum of the residuals from the least

    squares regression line is 0 ( ).

    The sum of the squared residuals is a

    minimum (minimized ).

    The simple regression line always passes

    through the mean of the y variable and the

    mean of the x variable.

    The least squares coefficients b0 and b1 are

    unbiased estimates of 0 and 1.

    0)( yy

    2)( yy

  • 7/30/2019 Simple Linear Regression[1]

    33/50

    33

    Exercise

    The growth of children from early childhood through

    adolescence generally follows a linear pattern. Data onthe heights of female Americans during childhood, fromfour to nine years old, were compiled and the leastsquares regression line was obtained as = 32 + 2.4xwhere is the predicted height in inches, andxis age inyears.

    Interpret the value of the estimated slope b1= 2. 4. Would interpretation of the value of the estimated

    y-intercept, b0= 32, make sense here? What would you predict the height to be for a female

    American at 8 years old? What would you predict the height to be for a femaleAmerican at 25 years old? How does the quality of thisanswer compare to the previous question?

  • 7/30/2019 Simple Linear Regression[1]

    34/50

    34

    The coefficient of determination is theportion of the total variation in thedependent variable that is explained byvariation in the independent variable.

    The coefficient of determination is alsocalled R-squared and is denoted as R2

    Coefficient of Determination, R2

    1R0 2

  • 7/30/2019 Simple Linear Regression[1]

    35/50

    35

    Coefficient of Determination, R2

    (continued)

    Note: In the single independent variable case, the coefficientof determination is

    where:R2 = Coefficient of determinationr = Simple correlation coefficient

    22

    rR

    E l f A i t

  • 7/30/2019 Simple Linear Regression[1]

    36/50

    36

    Examples of ApproximateR2 Values

    y

    x

    y

    x

    y

    x

    y

    x

    E l f A i t

  • 7/30/2019 Simple Linear Regression[1]

    37/50

    37

    Examples of ApproximateR2 Values

    R2 = 0

    No linear relationshipbetween x and y:

    The value of Y does not

    depend on x. (None of thevariation in y is explainedby variation in x)

    y

    xR2 = 0

    SPSS O t t

  • 7/30/2019 Simple Linear Regression[1]

    38/50

    38

    SPSS OutputModel Summary

    .762a .581 .528 41.33032

    Model

    1

    R R Square

    Adjusted

    R Square

    Std. Error of

    the Estimate

    Predictors: (Constant), Square Feeta.

    ANOVAb

    18934.935 1 18934.935 11.085 .010a

    13665.565 8 1708.196

    32600.500 9

    Regression

    Residual

    Total

    Model

    1

    Sum of

    Squares df Mean Square F Sig.

    Predictors: (Constant), Square Feeta.

    Dependent Variable: House Priceb.

    Coefficientsa

    98.248 58.033 1.693 .129

    .110 .033 .762 3.329 .010

    (Constant)

    Square Feet

    Model1

    B Std. Error

    UnstandardizedCoefficients

    Beta

    StandardizedCoefficients

    t Sig.

    Dependent Variable: House Pricea.

  • 7/30/2019 Simple Linear Regression[1]

    39/50

    39

    Standard Error of Estimate

    The standard deviation of the variation ofobservations around the regression line iscalled the standard error of estimate

    The standard error of the regressionslope coefficient (b1) is given by sb1

    s

  • 7/30/2019 Simple Linear Regression[1]

    40/50

    40

    Model Summary

    .762a .581 .528 41.33032

    Model

    1

    R R Square

    Adjusted

    R Square

    Std. Error of

    the Estimate

    Predictors: (Constant), Square Feeta.

    Coefficientsa

    98.248 58.033 1.693 .129

    .110 .033 .762 3.329 .010

    (Constant)

    Square Feet

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig.

    Dependent Variable: House Pricea.

    SPSS Output

    41.33032s

    0.03297s1b

  • 7/30/2019 Simple Linear Regression[1]

    41/50

    41

    Comparing Standard Errors

    y

    y y

    x

    x

    x

    y

    x

    1bssmall

    1

    bslarge

    ssmall

    slarge

    Variation of observed y valuesfrom the regression line Variation in the slope of regressionlines from different possible samples

    I f b t th Sl t T t

  • 7/30/2019 Simple Linear Regression[1]

    42/50

    42

    Inference about the Slope: t-Test t-test for a population slope

    Is there a linear relationship between X and Y? Null and alternate hypotheses

    H0: 1 = 0 (no linear relationship)

    H1: 1 0 (linear relationship does exist)

    Test statistic:

    Degree of Freedom:

    1b

    11

    s

    bt

    2nd.f.

    where:

    b1 = Sample regression slopecoefficient

    1

    = Hypothesized slope

    sb1 = Estimator of the standarderror of the slope

  • 7/30/2019 Simple Linear Regression[1]

    43/50

    43

    House Pricein $1000s

    (y)

    Square Feet(x)

    245 1400

    312 1600

    279 1700

    308 1875

    199 1100

    219 1550

    405 2350

    324 2450

    319 1425

    255 1700

    (sq.ft.)0.109898.25pricehouse

    Estimated Regression Equation:

    The slope of this model is 0.1098

    Does square footage of the houseaffect its sales price?

    Example: Inference about the Slope:t Test

    (continued)

    I f b t th Sl

  • 7/30/2019 Simple Linear Regression[1]

    44/50

    44

    Inferences about the Slope:tTest Example - Continue

    H0: 1 = 0

    HA: 1 0

    Test Statistic: t = 3.329

    There is sufficient evidencethat square footage affects

    house price

    From Excel output:

    Coeff icients Standard Error t Stat P-value

    Intercept 98.24833 58.03348 1.69296 0.12892Square Feet 0.10977 0.03297 3.32938 0.01039

    1bs t*

    b1

    Decision: Reject H0

    Conclusion:

    Reject H0Reject H0

    a/2=.025

    -t/2Do not reject H0

    0t(1-/2)

    a/2=.025

    -2.3060 2.3060 3.329

    d.f. = 10-2 = 8

  • 7/30/2019 Simple Linear Regression[1]

    45/50

    45

    Confidence Interval for the Slope

    Confidence Interval Estimate of the Slope:

    Excel Printout for House Prices:

    At 95% level of confidence, the confidence interval forthe slope is (0.0337, 0.1858)

    1b/211sb a t

    Coeff icients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    d.f. = n - 2

    C f f S

  • 7/30/2019 Simple Linear Regression[1]

    46/50

    46

    Confidence Interval for the Slope

    Since the units of the house price variable is$1000s, we are 95% confident that the averageimpact on sales price is between $33.70 and$185.80 per square foot of house size

    Coeff ic ient

    sStandard

    Error t Stat P-value Low er 95% Upper

    95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    This 95% confidence interval does not include 0.

    Conclusion: There is a significant relationship betweenhouse price and square feet at the .05 level of significance

    R id l A l i

  • 7/30/2019 Simple Linear Regression[1]

    47/50

    47

    Residual Analysis

    Purposes

    Examine for linearity assumption

    Examine for constant variance for alllevels of x

    Evaluate normal distribution assumption

    Graphical Analysis of Residuals

    Can plot residuals vs. x Can create histogram of residuals tocheck for normality

  • 7/30/2019 Simple Linear Regression[1]

    48/50

    48

    Residual Analysis for Linearity

    Not Linear Linear

    x

    residua

    ls

    x

    y

    x

    y

    x

    residua

    ls

  • 7/30/2019 Simple Linear Regression[1]

    49/50

    49

    Residual Analysis for Constant Variance

    Non-constant variance Constant variance

    x x

    y

    x x

    y

    residuals

    residuals

  • 7/30/2019 Simple Linear Regression[1]

    50/50

    50

    House Price Model Residual Plot

    -60

    -40

    -20

    0

    20

    40

    60

    80

    0 1000 2000 3000

    Square Feet

    Residuals

    Example: Residual Output

    RESIDUAL OUTPUTPredicted

    House

    Price Residuals

    1 251.92316 -6.923162

    2 273.87671 38.123293 284.85348 -5.853484

    4 304.06284 3.937162

    5 218.99284 -19.99284

    6 268.38832 -49.38832

    7 356.20251 48.79749

    8 367.17929 -43.17929

    9 254.6674 64.33264

    10 284.85348 -29.85348