statistics-linear regression and correlation analysis

Upload: dr-singh

Post on 06-Apr-2018

237 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    1/50

    1

    Linear Regressionand

    Correlation Analysis

    Prof Rushen Chahal

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    2/50

    2

    Chapter Goals

    To understand the methods for displaying and describingrelationship among two variables

    Models

    Linear regression

    Correlations Frequency tables

    Methods for Studying Relationships Graphical

    Scatter plots

    Line plots 3-D plots

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    3/50

    3

    Two Quantitative Variables

    The response variable, also called thedependent variable, is the variable we want

    to predict,and is usually denoted by y.

    The explanatory variable, also called theindependent variable, is the variable thatattempts to explain the response, and isdenoted byx.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    4/50

    4

    Scatter Plots and Correlation

    A scatter plot (or scatter diagram) is usedto show the relationship between twovariables

    Correlation analysis is used to measurestrength of the association (linearrelationship) between two variables

    Only concerned with strength of therelationship

    No causal effect is implied

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    5/50

    5

    Example

    The following graphshows the scatter plotofExam 1 score (x)

    and Exam 2 score (y)for 354 students in aclass.

    Is there a relationship

    between x and y?

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    6/50

    6

    Scatter Plot Examples

    y

    x

    y

    x

    y

    y

    x

    x

    Linear relationships Curvilinear relationships

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    7/50

    7

    Scatter Plot Examples

    y

    x

    y

    x

    No relationship(continued)

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    8/50

    8

    Correlation Coefficient

    The population correlation coefficient (rho) measures the strength of theassociation between the variables

    The sample correlation coefficient ris an estimate of and is used tomeasure the strength of the linearrelationship in the sample observations

    (continued)

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    9/50

    9

    Features of and r

    Unit free

    Range between -1 and 1

    The closer to -1, the stronger thenegative linear relationship

    The closer to 1, the stronger the positivelinear relationship

    The closer to 0, the weaker the linearrelationship

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    10/50

    10

    Examples of Approximate r Values

    yy

    x

    y

    x

    x

    y

    x

    y

    x

    Tag with appropriate value: -1, -.6, 0, +.3, 1

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    11/50

    11

    Earlier Example

    Correlations

    .

    .

    .

    .

    earson orre a on

    g. -ta e )

    earson orre at on

    g. -ta e )

    Exam1

    Exam2

    orrea ton s sgn cant at t e . eve.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    12/50

    12

    Questions?

    What kind of relationship would you expectin the following situations:

    Age (in years) of a car, and its price.

    Number of calories consumed per day andweight.

    Height and IQ of a person.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    13/50

    13

    Exercise

    Identify the two variables that vary and decidewhich should be the independent variable andwhich should be the dependent variable.

    Sketch a graph that you think best representsthe relationship between the two variables.

    1. The size of a persons vocabulary over his orher lifetime.

    2. The distance from the ceiling to the tip of theminute hand of a clock hung on the wall.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    14/50

    14

    Introduction to Regression Analysis

    Regression analysis is used to:

    Predict the value of a dependent variable basedon the value of at least one independent variable.

    Explain the impact of changes in an independentvariable on the dependent variable.

    Dependent variable: the variable we wish to

    explain.Independent variable: the variable used to

    explain the dependent variable.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    15/50

    15

    Simple Linear Regression

    Model Only one independent variable, x.

    Relationship between x and y isdescribed by a linear function.

    Changes in y are assumed to be

    caused by changes in x.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    16/50

    16

    Types of Regression Models

    Positive Linear Relationship

    Negative Linear Relationship

    Relationship NOT

    Linear

    No Relationship

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    17/50

    17

    ni ,...,2,1; !! ii10i XY

    Linear component

    Population Linear Regression

    The population regression model:

    Populationy intercept

    PopulationSlopeCoefficient

    RandomError term,

    or residual

    Dependent

    Variable

    IndependentVariable

    Random Errorcomponent

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    18/50

    18

    Linear Regression Assumptions

    The Error terms i, i=1, 2. , n areindependent and i ~ Normal (0,

    2).

    The Error terms i, i=1, 2. , n have constant

    variance 2

    . The underlying relationship between the X

    variable and the Y variable is linear.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    19/50

    19

    Population Linear Regression(continued)

    Random Errorfor this xi value

    Y

    X

    Observed Valueof y

    i

    for xi

    Predicted Valueof yi for xi

    ii10i XY !

    xi

    Slope = 1

    Y-Intercept = 0

    i

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    20/50

    20

    i10i XbbY !

    The sample regression line provides an estimateof the population regression function.

    Estimated Regression Model

    Estimate ofthe regressiony-intercept

    Estimate of theregression slopeEstimated

    (or predicted)y value

    Independent

    variable

    The individual random error terms ei have a mean of zero

    The Regression Function: iiii XEXYE 1010 )()( FFIFF !!

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    21/50

    21

    Earlier Example

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    22/50

    22

    ResidualA residualis the difference between the observed

    response yiand the predicted response i. Thus, foreach pair of observations (xi,yi), the i

    th residual is

    ei= yi i= yi (b0 + b1xi)

    Least Squares Criterion b0 and b1 are obtained by finding the

    values of b0 and b1 that minimize the sum

    of the squared residuals.

    2

    10

    22

    x))b(b(y

    )y(ye

    !

    !

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    23/50

    23

    b0 is the estimated average value of

    y when the value of x is zero.

    b1 is the estimated change in the

    average value of y as a result of a

    one-unit change in x.

    Interpretation of theSlope and the Intercept

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    24/50

    24

    The Least Squares Equation

    The formulas for b1 and b0 are:

    algebraic equivalent:

    !

    nxx

    n

    yxxy

    b2

    2

    1

    )(

    !

    21 )(

    ))((

    xx

    yyxxb

    xbyb 10 !and

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    25/50

    25

    Finding the Least Squares Equation

    The coefficients b0 and b1 willusually be found using computersoftware, such as Excel, Minitab, or

    SPSS.

    Other regression measures will alsobe computed as part of computer-based regression analysis.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    26/50

    26

    Simple Linear RegressionExample

    A real estate agent wishes to examine therelationship between the selling price of a

    home and its size (measured in square feet)

    A random sample of 10 houses is selected

    Dependent variable (y) = house price in$1000s

    Independent variable (x) = square feet

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    27/50

    27

    Sample Data for House PriceModel

    House Price in $1000s(y)

    Square Feet(x)

    245 1400

    312 1600279 1700

    308 1875

    199 1100

    219 1550405 2350

    324 2450

    319 1425

    255 1700

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    28/50

    28

    Model Summary

    . a . . .

    Adjusted

    Std. Error of

    re c ors: ons an , quare ee.

    Coefficientsa

    . . . .

    . . . . .

    ons an

    quare ee

    1

    .

    Unstandardized Standardized

    .

    epen en ar a e: ouse r ce.

    SPSS Output

    The regression equation is:

    feet)(square0.11098.248pricehouse !

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    29/50

    29

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    0 500 1000 1500 2000 2500 3000

    Square Feet

    HousePrice($1000s)

    Graphical Presentation

    House price model: scatter plot andregression line

    feet)(square0.11098.248pricehouse !

    Slope= 0.110

    Intercept

    = 98.248

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    30/50

    30

    Interpretation of the Intercept, b0

    b0 is the estimated average value of Y when

    the value of X is zero (if x = 0 is in the rangeof observed x values)

    Here, no houses had 0 square feet, so

    b0

    = 98.24833 just indicates that, for houses

    within the range of sizes observed,

    $98,248.33 is the portion of the house price

    not explained by square feet

    feet)(square0.11098.248pricehouse !

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    31/50

    31

    Interpretation of theSlope Coefficient, b1

    b1 measures the estimated change in

    the average value of Y as a result of

    a one-unit change in X

    Here, b1 = .10977

    tells us that the averagevalue of a house increases by .10977($1000)

    = $109.77, on average, for each additional

    one square foot of size

    feet)(square0.1097798.24833pricehouse !

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    32/50

    32

    Least Squares Regression Properties

    The sum of the residuals from the leastsquares regression line is 0 ( ).

    The sum of the squared residuals is a

    minimum (minimized ). The simple regression line always passes

    through the mean of the y variable and the

    mean of the x variable.

    The least squares coefficients b0 and b1 are

    unbiased estimates of 0 and 1.

    0)( ! yy

    2)( yy

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    33/50

    33

    Exercise

    The growth of children from early childhood throughadolescence generally follows a linear pattern. Data onthe heights of female Americans during childhood, fromfour to nine years old, were compiled and the leastsquares regression line was obtained as = 32 + 2.4xwhere is the predicted height in inches, and xis age in

    years. Interpret the value of the estimated slope b1 = 2. 4. Would interpretation of the value of the estimated

    y-intercept, b0 = 32, make sense here? What would you predict the height to be for a female

    American at 8 years old? What would you predict the height to be for a femaleAmerican at 25 years old? How does the quality of thisanswer compare to the previous question?

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    34/50

    34

    The coefficient of determination is theportion of the total variation in thedependent variable that is explained byvariation in the independent variable.

    The coefficient of determination is alsocalled R-squared and is denoted as R2

    Coefficient of Determination, R2

    1R02ee

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    35/50

    35

    Coefficient of Determination, R2

    (continued)

    Note: In the single independent variable case, the coefficientof determination is

    where:R2 = Coefficient of determinationr = Simple correlation coefficient

    22

    rR !

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    36/50

    36

    Examples of ApproximateR2 Values

    y

    x

    y

    x

    y

    x

    y

    x

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    37/50

    37

    Examples of ApproximateR2 Values

    R2 = 0

    No linear relationshipbetween x and y:

    The value of Y does not

    depend on x. (None of thevariation in y is explainedby variation in x)

    y

    xR2 = 0

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    38/50

    38

    SPSS OutputModel Summary

    . a . . .1

    Adjusted

    Std. Error of

    re c ors: ons an , quare ee.

    ANOVAb

    . . . . a

    . .

    .

    egress on

    es ua

    ota

    1

    Sum of

    .

    re c ors: ons an , quare eea.

    e en en ar a e: ouse r ce.

    Coefficientsa

    . . . .

    . . . . .

    onstant

    quare eet

    1.

    Unstandardized Standardized

    .

    epen en ar a e: ouse r ce.

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    39/50

    39

    Standard Error of Estimate

    The standard deviation of the variation ofobservations around the regression line iscalled the standard error of estimate

    The standard error of the regressionslope coefficient (b1) is given by sb1

    Is

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    40/50

    40

    Model Summary

    . a . . .1

    Adjusted

    Std. Error of

    re c ors: ons an , quare eea.

    Coefficientsa

    . . . .

    . . . . .

    (Constant)

    quare ee

    1

    .

    Unstandardized Standardized

    .

    epen en ar a e: ouse r ce.

    SPSS Output

    41.33032s !

    0.03297s1b

    !

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    41/50

    41

    Comparing Standard Errors

    y

    y y

    x

    x

    x

    y

    x

    1bssmall

    1bslarge

    Issmall

    Islarge

    Variation of observed y valuesfrom the regression line

    Variation in the slope of regressionlines from different possible samples

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    42/50

    42

    Inference about the Slope: t-Test t-test for a population slope

    Is there a linear relationship between X and Y? Null and alternate hypotheses

    H0: 1 = 0 (no linear relationship)

    H1: 1 { 0 (linear relationship does exist) Test statistic:

    Degree of Freedom:

    1b

    11

    s

    bt

    !

    2nd.f. !

    where:

    b1 = Sample regression slopecoefficient

    1 = Hypothesized slopesb1 = Estimator of the standard

    error of the slope

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    43/50

    43

    House Pricein $1000s

    (y)

    Square Feet(x)

    245 1400

    312 1600

    279 1700

    308 1875

    199 1100

    219 1550

    405 2350

    324 2450

    319 1425

    255 1700

    (sq.ft.)0.109898.25pricehouse !

    Estimated Regression Equation:

    The slope of this model is 0.1098

    Does square footage of the houseaffect its sales price?

    Example: Inference about the Slope:t Test

    (continued)

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    44/50

    44

    Inferences about the Slope:t Test Example - Continue

    H0: 1 = 0

    HA: 1 { 0

    Test Statistic: t = 3.329

    There is sufficient evidencethat square footage affects

    house price

    From Excel output:

    Coefficients Standard Error t Stat P-value

    Intercept 98.24833 58.03348 1.69296 0.12892

    Square Feet 0.10977 0.03297 3.32938 0.01039

    1bs t*

    b1

    Decision: Reject H0

    Conclusion:

    Reject H0Reject H0

    E/2=.025

    -t/2Do not reject H0

    0t(1-/2)

    E/2=.025

    -2.3060 2.3060 3.329

    d.f. = 10-2 = 8

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    45/50

    45

    Confidence Interval for the Slope

    Confidence Interval Estimate of the Slope:

    Excel Printout for House Prices:

    At 95% level of confidence, the confidence interval forthe slope is (0.0337, 0.1858)

    1b/211sb Es t

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    d.f. = n - 2

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    46/50

    46

    Confidence Interval for the Slope

    Since the units of the house price variable is$1000s, we are 95% confident that the averageimpact on sales price is between $33.70 and$185.80 per square foot of house size

    Coefficient

    s

    Standard

    Error t Stat P-value Lower 95%

    Upper

    95%

    Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

    Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

    This 95% confidence interval does not include 0.

    Conclusion: There is a significant relationship betweenhouse price and square feet at the .05 level of significance

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    47/50

    47

    Residual Analysis

    PurposesExamine for linearity assumption

    Examine for constant variance for alllevels of x

    Evaluate normal distribution assumption

    Graphical Analysis of Residuals

    Can plot residuals vs. xCan create histogram of residuals tocheck for normality

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    48/50

    48

    Residual Analysis for Linearity

    Not Linear Linear

    x

    residua

    ls

    x

    y

    x

    y

    x

    residua

    ls

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    49/50

    49

    Residual Analysis for Constant Variance

    Non-constant variance Constant variance

    x x

    y

    x x

    y

    resid

    uals

    resid

    uals

  • 8/3/2019 Statistics-Linear Regression and Correlation Analysis

    50/50

    50

    House Price Model Residual Plot

    -60

    -40

    -20

    0

    20

    40

    60

    80

    0 1000 2000 3000

    Square Feet

    Residuals

    Example: Residual Output

    RESIDUAL OUTPUT

    Predicted

    House

    Price Residuals

    1 251.92316 -6.923162

    2 273.87671 38.12329

    3 284.85348 -5.853484

    4 304.06284 3.937162

    5 218.99284 -19.99284

    6 268.38832 -49.38832

    7 356.20251 48.79749

    8 367.17929 -43.17929

    9 254.6674 64.33264

    10 284.85348 -29.85348