l10-edu5950 simple regression analysis

Upload: khairul-affan-hassan

Post on 06-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    1/24

    1

    EDU5950

    SEM2 2010-11

    CORRELATION &SIMPLE REGRESSION

    Correlation - Test of association

    !  A correlation measures the “degree of association” between

    two variables (interval or ordinal)

    !  Associations can be positive (an increase in one variable is

    associated with an increase in the other) or negative (an

    increase in one variable is associated with a decrease in the

    other)!  Correlation is measured in “r” (parametric, Pearson’s) or

    “!” (non-parametric, Spearman’s)

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    2/24

    2

    Test of association - Correlation

    !  Compare two continuous variables in terms of degree ofassociation

    "  e.g. attitude scale vs behavioural frequency

    0

    50

    100

    150

    200

    250

    300

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    0 50 100 150 200 250

    Positive Negative

    Test of association - Correlation

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    0 50 100 150 200 250

    0

    20

    40

    60

    80

    100

    120

    140

    160

    0 50 100 150 200 250

    !  Test statistic is “r” (parametric) or “ ” (non-parametric)

    0 (random distribution, zero correlation)

    1 (perfect correlation)

    High Low

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    3/24

    3

    Test of association - Correlation

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    0 50 100 150 200 250

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    0 50 100 150 200 250

    !  Test statistic is “r” (parametric) or “ ” (non-parametric)

    0 (random distribution, zero correlation)

    1 (perfect correlation)

    High Zero

    6

    Regression & Correlation!

      A correlation measures the “degree of

    association” between two variables (interval

    (50,100,150…) or ordinal (1,2,3...))

    Associations can be positive (an increase in onevariable is associated with an increase in the

    other) or negative (an increase in one variable is

    associated with a decrease in the other) 

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    4/24

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    5/24

    5

    !  “Best fit line”

    !  Allows us to describe

    relationship between variables

    more accurately.

    !  We can now predict specific

    values of one variable from

    knowledge of the other

    !  All points are close to the line

    Graph Three: Relationship between

     Symptom Index and Drug A

    (with best-fit line)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    0 50 100 150 200 250

    Drug A (dose in mg)

       S  y  m  p   t  o  m    I

      n   d  e  x

    Example: Symptom Index vs Drug A

    Graph Four: Relationship between Symptom

     Index and Drug B

    (with best-fit line)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    0 50 100 150 200 250

    Drug B (dose in mg)

       S  y  m  p   t  o  m    I

      n   d

      e  x

    !  We can still predict specific

    values of one variable from

    knowledge of the other

    !  Will predictions be as accurate?

    !  Why not?

    !  “Residuals”

    Example: Symptom Index vs Drug B

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    6/24

    6

    11

    Correlation examples

    "  Regression analysis procedures have as their

     primary purpose the development of an

    equation that can be used for predicting values

    on some DV for all members of a population.

    "  A secondary purpose is to use regressionanalysis as a means of explaining causal

    relationships among variables.

    Regression

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    7/24

    7

    "  The most basic application of regression analysis is the

     bivariate situation, to which is referred as simple linear

    regression, or just simple regression.

    "  Simple regression involves a single IV and a single DV.

    "  Goal: to obtain a linear equation so that we can predict the

    value of the DV if we have the value of the IV.

    "  Simple regression capitalizes on the correlation between the

    DV and IV in order to make specific predictions about theDV.

    "  The correlation tells us how much information about the

    DV is contained in the IV.

    "  If the correlation is perfect (i.e r = ±1.00), the IV contains

    everything we need to know about the DV, and we will

     be able to perfectly predict one from the other.

    "  Regression analysis is the means by which we determine

    the best-fitting line, called the regression line.

    "  Regression line is the straight line that lies closest to all

     points in a given scatterplot

    "  This line sometimes pass through the centroid of the

    scatterplot.

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    8/24

    8

    "  3 important facts about the regression line must beknown:

    "  The extent to which points are scattered around the line

    The slope of the regression line

    "  The point at which the line crosses the Y-axis

    "  The extent to which the points are scattered around theline is typically indicated by the degree of relationship

     between the IV (X) and DV (Y)."

     

    This relationship is measured by a correlationcoefficient – the stronger the relationship, the higher thedegree of predictability between X and Y.

    "  The degree of slope is determined by the amount

    of change in Y that accompanies a unit change in

    X.

    "  It is the slope that largely determines the predicted

    values of Y from known values for X.

    "  It is important to determine exactly where the

    regression line crosses the Y-axis (this value is

    known as the Y-intercept).

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    9/24

    9

    "  The regression line is essentially an equation that

    express Y as a function of X.

    The basic equation for simple regression is:

    "  Y  = a + b X  

    "  where Y  is the predicted value for the DV,

    "   X  is the known raw score value on the IV,

    "  b is the slope of the regression line

    "  a is the Y-intercept

    Simple Linear Regression

    !  Purpose

    "  determine relationship between two metric variables

    "  predict value of the dependent variable (Y ) based on

    value of independent variable ( X )

    !  Requirement :

    "  DV Interval / Ratio

    "   IV Internal / Ratio

    !  Requirement :

    " The independent and dependent variables are normally

    distributed in the population

    " The cases represents a random sample from the population

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    10/24

    10

    Simple Regression How best to summarise the data?

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    0 50 100 150 200 250

    Drug A (dose in mg)

       S  y   m   p   t   o   m    I

       n   d   e  x

    0

    20

    40

    60

    80

    100

    120

    140

    160

    0 50 100 150 200 250

    Drug A (dose in mg)

       S  y   m   p   t   o   m    I

       n   d   e  x

    Adding a best-fit line allows us to describe data simply

    Establish equation for the best-fit line:

    Y  = a + b X 

    General Linear Model (GLM)  How best to summarise the data?

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    0 50 100 150 200 250

    Where: a = y intercept (constant)b = slope of best-fit line

    Y = dependent variable

     X = independent variable 

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    11/24

    11

    For simple regression, R 2 is the square of the correlationcoefficient 

    Reflects variance accounted for in data by the best-fit line

    "  Takes values between 0 (0%) and 1 (100%)

    Frequently expressed as percentage, rather than decimal

    "  High values show good fit, low values show poor fit

    Simple Regression  R2 - “Goodness of fit”

    "  R 2 = 0

    (0% - randomly scattered

     points, no apparent

    relationship between Xand Y)

    "  Implies that a best-fit line

    will be a very poor

    description of data0

    50

    100

    150

    200

    250

    300

    0 100 200 300

    IV (regressor, predictor)

    DV

    Simple Regression  Low values of R2

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    12/24

    12

    "  R 2 = 1

    "  (100% - points lie directly

    on the line - perfect

    relationship between X

    and Y)"

     

    Implies that a best-fit line

    will be a very good

    description of data

    0

    50

    100

    150

    200

    250

    300

    0 100 200 300

    IV

            D        V

    0

    50

    100

    150

    200

    250

    0 50 100 150 200 250

    IV

          D      V

    Simple Regression  High values of R2

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    0 50 100 150 200 250

    Drug A (dose in mg)

       S  y  m  p   t  o  m    I  n   d  e  x

    0

    20

    40

    60

    80

    100

    120

    140

    160

    0 50 100 150 200 250

    Drug B (dose in mg)

       S

      y  m  p   t  o  m    I

      n   d  e  x

    Good fit !  R2 high

     High variance explained 

     Moderate fit !  R2 lower

     Less variance explained  

    Simple Regression  R2 - “Goodness of fit”

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    13/24

    13

    25

    Problem: to draw a straight line through the points

    that best explains the variance

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0 1 2 3 4 5 6

    Line can then be used to

     predict Y from X

    26

    "  “Best fit line”

    "  allows us to describe relationship

     between variables more accurately.

    "  We can now predict specific values

    of one variable from knowledge of

    the other

    "  All points are close to the line

    Graph Three: Relationship between

     Symptom Index and Drug A

    (with best-fit line)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    0 50 100 150 200 250

    Drug A (dose in mg)

       S  y  m  p   t  o  m    I  n

       d  e  x

    Example: Symptom Index vs Drug A 

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    14/24

    14

    27

    "  Establish equation for the best-fit line:

    Y  = a + b X 

    !  Best-fit line same as regression line

    !   b is the regression coefficient for x

    !  x is the predictor or regressor variable for  y 

    Regression

    Regression - Types

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    15/24

    15

    Linear Regression - Model

    ii   iY    X   ! !    " 

    0 1+   +=

    Regression Coefficients

    Population

    Sample

    Y = a + b X  ˆ 

    Constant

    30

    Parameters

    #  The population parameters andare simple the least squares estimatescomputed on all the members of thepopulation, not just the sample

    #  Population parameters:#  Sample statistics: a and b

    01

    0 1and  

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    16/24

    16

    31

    Inference About the PopulationSlope and Intercept

    #  If then we have a graph like this:

    0 1 Y = X 

     

    1  0 

    X

    0 1 X  

    32

    Inference About the PopulationSlope and Intercept

    #  If then we have a graph like this:

    0 1 Y = X 

     

    1  0 

    X

    0 1 X  

    This is the meanof Y for those

    whoseindependentvariable is X

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    17/24

    17

    Copyright (c) Bani K. Mallick 33

    Inference About the PopulationSlope and Intercept

    #  If then we have a graph like this:

    0 1 Y = X 

     

    1  0 

    X

    0 1 X   Note how the mean

    of Y does not dependon X: Y and X are

    independent 

    34

    Linear Regression and Correlation

    #  If then Y and X are independent

    #  So, we can test the null hypothesis

    #  that Y and X are independent by testing

    #  The p-value in regression tables tests this

    hypothesis

    0 1 Y = X 

     

    1  0 

    0H :

    0 1H : 0

     

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    18/24

    18

    X Y

    Temperature Sales

    63 1.52

    70 1.68

    73 1.8

    75 2.05

    80 2.36

    82 2.25

    85 2.68

    88 2.9

    90 3.14

    91 3.06

    92 3.24

    75 1.92

    98 3.4

    100 3.28

    92 3.17

    87 2.83

    84 2.58

    88 2.86

    80 2.26

    82 2.14

    76 1.98

    Ice Cream Example

    Ice Cream Sales

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    0 20 40 60 80 100 120

    Simple Regression Lineˆ Y = a + b X  

    TWO STEPS TO SIMPLE LINEAR REGRESSION

     Descriptive

     InferentialHypothesis Test :

    1 Regression Model

    2 Slope

    #  Regression equation : !  = a + b X

    # Correlation coefficient (r)

    # Coefficient of Determination (r $)

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    19/24

    19

     First Step Descriptive

    Derive Regression / Prediction equation

    #  Calculate a and b 

    a = y – b  X

    !  = a + b X  

    Example1 :

    Data were collected from a randomly

    selected sample to determine relationship

     between average assignment scores and test

    scores in statistics. Distribution for

    the data is presented in the table below.

    1. Calculate coefficient of determination

    and the correlation coefficient

    2. Determine the prediction equation.

    3. Test hypothesis for the slope at 0.05 level

      of significance

    Data set:

    Scores

     ID Assign Test

    1 8.5 88

    2 6 66

    3 9 94

    4 10 98

    5 8 87

    6 7 72

    7 5 45

    8 6 63

    9 7.5 85

    10 5 77

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    20/24

    20

    1. 

    Derive Regression / Prediction equation

    215.5

    26.1= 8.257=

    a= y – b x= 77.5 – 8.257 (7.2)

    = 18.050

    ID X Y1 8.5 88

    2 6 66

    3 9 94

    4 10 98

    5 8 87

    6 7 72

    7 5 45

    8 6 63

    9 7.5 85

    10 5 77

    Summary stat:

    n 10

    %&   72

    %'   775

    %&$   544.5

    %'$   62,441

    %&'   5,795.5

    Prediction equation:

    !  = 18.05 + 8.257X

    Interpretation of regression equation

    !  = 18.05 + 8.257x 

    For every 1 unit change in X,

    Y will change by 8.257 units

    (X

    (Y 8. 2 5  7 

    18.05

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    21/24

    21

    MARITAL SATISFACTION 

    Parents : X Children : Y  

    1  3 

    3  2 

    7  6 

    9  7 

    8  8 

    4  6 

    5  3 Mean of X Mean of Y

     No of pairs

    " X " Y

    " X squared " X squared

    Standard deviation Standard deviation

    " XY

    Example 2:

    1. 

    Derive Regression / Prediction equation

    a= y – b x

    = 5.00 +.65 (5.29)

    = 8.438

    Prediction equation:

    !  = 8.44 + 65x

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    22/24

    22

    Interpretation of regression equation

    !  = 8.43 + .65x

    For every 1 unit change in X,

    Y will change by .65 units

    (X

    (Y 0. 6 5

     

    8.43

    Descriptive Statistics 

    Mean  Std. Deviation  N 

    Grade - PMR MATH  2.53  1.468  62 

    TEACHER_FACTOR   3.9643  .91443  62 

    Correlations  

    Grade - PMR

    MATH 

    TEACHER_F

     ACTOR 

    Pearson Correlation  Grade - PMR MATH  1.000  .571 

    TEACHER_FACTO

    .571  1.000 

    Sig. (1-tailed)  Grade - PMR MATH  .  .000 

    TEACHER_FACTO

    .000  . 

    N  Grade - PMR MATH  62  62 

    TEACHER_FACTO

    62  62 

    Model Summaryb 

    Model 

    R  R Square 

     Adjusted R

    Square 

    Std. Error of the

    Estimate 

    di

    m

    e

    n

    si

    o

    n

    .571a 

    .326 

    .315 

    1.215 

    a. Predictors: (Constant), TEACHER_FACTOR 

    b. Dependent Variable: Grade - PMR MATH 

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    23/24

    23

    ANOVAb 

    Model  Sum of

    Squares  df  

    Mean

    Square  F 

    Sig. 

    1  Regression  42.848  1  42.848  29.021  .000a 

    Residual  88.588  60  1.476 

    Total  131.435  61 

    a. Predictors: (Constant), TEACHER_FACTOR 

    b. Dependent Variable: Grade - PMR MATH 

    Coefficientsa 

    Model 

    Unstandardized Coefficients 

    Standardized

    Coefficients 

    Sig. 

    Std. Error  

    Beta 

    1  (Constant)  -1.101  .692  -1.591  .117 

    TEACHER_FACTOR  .917  .170  .571  5.387  .000 

    a. Dependent Variable: Grade - PMR MATH 

    Descriptive Statistics 

    Mean 

    Std. Deviation 

    Grade - PMR MATH 

    2.53 

    1.468 

    62 

    TEACHER_FACTOR 

    3.9643 

    .91443 

    62 

    Race 

    1.90 

    .593 

    62 

    Correlations 

    Grade -PMR MATH 

    TEACHER _FACTOR  Race 

    Pearson

    Correlation 

    Grade - PMRMATH 

    1.000  .571  -.015 

    TEACHER_FACTOR

     

    .571  1.000  .019 

    Race 

    -.015 

    .019 

    1.000 

    Sig. (1-tailed)  Grade - PMRMATH 

    .  .000  .453 

    TEACHER_FACTOR 

    .000  .  .440 

    Race  .453  .440  . 

    Grade - PMRMATH 

    62  62  62 

    TEACHER_FACTOR 

    62  62  62 

    Race  62  62  62 

    Model Summaryb 

    Model 

    R  R Square 

     Adjusted R

    Square 

    Std. Error of

    the Estimate 

    d

    i

    m

    e

    n

    s

    i

    o

    n

    1  .572a  .327  .304  1.225 

    a. Predictors: (Constant), Race, TEACHER_FACTOR 

    b. Dependent Variable: Grade - PMR MATH 

  • 8/17/2019 l10-Edu5950 Simple Regression Analysis

    24/24

    24

    ANOVAb 

    Model  Sum of

    Squares  df 

     

    Mean

    Square  F

     

    Sig. 

    Regression 

    42.939 

    21.469 

    14.313 

    .000a 

    Residual 

    88.497 

    59 

    1.500 

    Total 

    131.435 

    61 

    a. Predictors: (Constant), Race, TEACHER_FACTOR 

    b. Dependent Variable: Grade - PMR MATH 

    Coefficientsa 

    Model 

    Unstandardized Coefficients 

    Standardized

    Coefficients 

    Sig. 

    Std. Error  

    Beta 1

     

    (Constant) 

    -.980 

    .853 

    -1.150 

    .255 

    TEACHER_FACTOR 

    .917 

    .172 

    .571 

    5.349 

    .000 

    Race  -.065  .265  -.026  -.246  .806 

    a. Dependent Variable: Grade - PMR MATH