lecture 7 regression and correlation

Upload: fa2heem

Post on 02-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Lecture 7 Regression and Correlation

    1/51

    Correlation and

    Regression

  • 8/11/2019 Lecture 7 Regression and Correlation

    2/51

    Correlation and Regression

    The test you choose depends on level of measurement:

    Independent Dependent Test

    Dichotomous Interval-Ratio Independent Samples t-testDichotomous

    Nominal Interval-Ratio ANOVA

    Dichotomous Dichotomous

    Nominal Nominal Cross Tabs

    Dichotomous Dichotomous

    Interval-Ratio Interval-Ratio Bivariate Regression/Correlation

    Dichotomous

  • 8/11/2019 Lecture 7 Regression and Correlation

    3/51

    Correlation and Regression

    Bivariate regression is a technique that fits a

    straight line as close as possible between all the

    coordinates of two continuous variables plottedon a two-dimensional graph--to summarize the

    relationship between the variables

    Correlation is a statistic that assesses thestrength and direction of association of two

    continuous variables . . . It is created through a

    technique called regression

  • 8/11/2019 Lecture 7 Regression and Correlation

    4/51

    Bivariate Regression

    For example:

    A criminologist may be interested in the

    relationship between Income and Number ofChildren in a family or self-esteem and

    criminal behavior.

    Independent VariablesFamily Income

    Self-esteem

    Dependent Variables

    Number of Children

    Criminal Behavior

  • 8/11/2019 Lecture 7 Regression and Correlation

    5/51

    Bivariate Regression

    For example:

    Research Hypotheses:

    As family income increases, the number of children in

    families declines (negative relationship).

    As self-esteem increases, reports of criminal behavior

    increase (positive relationship).

    Independent VariablesFamily Income

    Self-esteem

    Dependent Variables

    Number of Children

    Criminal Behavior

  • 8/11/2019 Lecture 7 Regression and Correlation

    6/51

    Bivariate Regression

    For example:

    Null Hypotheses:

    There is no relationship between family income and the

    number of children in families. The relationship statistic b = 0.

    There is no relationship between self-esteem and criminal

    behavior. The relationship statistic b = 0.

    Independent VariablesFamily Income

    Self-esteem

    Dependent Variables

    Number of Children

    Criminal Behavior

  • 8/11/2019 Lecture 7 Regression and Correlation

    7/51

    Bivariate Regression Lets look at the relationship between self-

    esteem and criminal behavior.

    Regression starts with plots of coordinates of

    variables in a hypothesis (although you will

    hardly ever plotyour data in reality).

    The data:

    Each respondent has filled out a self-esteem

    assessment and reported number of crimes

    committed.

  • 8/11/2019 Lecture 7 Regression and Correlation

    8/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    What do you think

    the relationship is?

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

  • 8/11/2019 Lecture 7 Regression and Correlation

    9/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    Is it positive?

    Negative?No change?

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

  • 8/11/2019 Lecture 7 Regression and Correlation

    10/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    Regression is a procedure that

    fits a line to the data. The slope

    of that line acts as a model for

    the relationship between theplotted variables.

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

  • 8/11/2019 Lecture 7 Regression and Correlation

    11/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    The slope of a line is the change in the correspondingY value for each unit increase in X (rise over run).

    Slope = 0, No relationship!

    Slope = 0.2, Positive Relationship!

    1

    Slope = -0.2, Negative Relationship!

    1

    0.5

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

  • 8/11/2019 Lecture 7 Regression and Correlation

    12/51

    Bivariate Regression

    The mathematical equation for a line:

    Y = mx + b

    Where: Y = the lines position on thevertical axis at any point

    X = the lines position on the

    horizontal axis at any point

    m = the slope of the lineb = the intercept with the Y axis,

    where X equals zero

  • 8/11/2019 Lecture 7 Regression and Correlation

    13/51

    Bivariate Regression

    The statistics equation for a line:

    Y = a + bx

    Where: Y = the lines position on the

    vertical axis at any point (value ofdependent variable)

    X = the lines position on the

    horizontal axis at any point (value of

    the independent variable)

    b = the slope of the line (called the coefficient)

    a = the intercept with the Y axis,

    where X equals zero

    ^

    ^

  • 8/11/2019 Lecture 7 Regression and Correlation

    14/51

    Bivariate Regression

    The next question:

    How do we draw the line???

    Our goal for the line:

    Fit the line as close as possible to all thedata points for all values of X.

  • 8/11/2019 Lecture 7 Regression and Correlation

    15/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    How do we minimize the

    distance between a line and all

    the data points?

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

  • 8/11/2019 Lecture 7 Regression and Correlation

    16/51

    Bivariate Regression

    How do we minimize the distance between a line and

    all the data points?

    You already know of a statistic that minimizes thedistance between itself and all data values for a

    variable--the mean!

    The mean minimizes the sum of squareddeviations--it is where deviations sum to zero and

    where the squared deviations are at their lowest

    value. (Y - Y-bar)2

  • 8/11/2019 Lecture 7 Regression and Correlation

    17/51

    Bivariate Regression

    The mean minimizes the sum of squared

    deviations--it is where deviations sum to zero and

    where the squared deviations are at their lowestvalue.

    Take this principle and fit the line

    to the place

    where squared deviations (on Y) from the line are

    at their lowest value (across all X

    s).

    (Y - Y)2 Y = line^ ^

  • 8/11/2019 Lecture 7 Regression and Correlation

    18/51

    Bivariate Regression

    There are several lines that you could draw where the

    deviations would sum to zero...

    Minimizing the sum of squared errors gives you the

    unique, best fitting line for all the data points. It is the

    line that is closest to all points.

    Y or Y-hat = Y value for line at any X

    Y = case value on variable Y Y - Y = residual

    (YY) = 0; therefore, we use (Y - Y)2

    and minimize that!

    ^

    ^

    ^ ^

  • 8/11/2019 Lecture 7 Regression and Correlation

    19/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Illustration of Y Y

    = Yi, actual Y value corresponding w/ actual X

    = Yi, line level on Y corresponding w/ actual X

    5

    -4

    Y = 10, Y = 5

    Y = 0, Y = 4

  • 8/11/2019 Lecture 7 Regression and Correlation

    20/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Illustration of (Y Y)2

    = Yi, actual Y value corresponding w/ actual X

    = Yi, line level on Y corresponding w/ actual X

    5

    -4

    (YiY)2 = deviation2

    Y = 10, Y = 5 . . . 25

    Y = 0, Y = 4 . . . 16

  • 8/11/2019 Lecture 7 Regression and Correlation

    21/51

  • 8/11/2019 Lecture 7 Regression and Correlation

    22/51

    Bivariate Regression

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    The fitted line for ourexample has the equation:

    Y = - X

    , where e= distance fromline to data points or error

    If you were to draw any other line, it

    would not minimize . . .

    (Y - Y)201

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Y = a+ bX

    e

  • 8/11/2019 Lecture 7 Regression and Correlation

    23/51

    Bivariate Regression

    We use (Y - Y)2 and minimize that!

    There is a simple, elegant formula for

    discovering

    the line that minimizes the sum of

    squared errors

    ((X - X)(Y - Y))

    b = (X - X)2 a = Y - bX Y = a + bX

    This is the method of least squares, it gives our

    least squares estimate and indicates why we call

    this technique ordinary least squaresor OLS

    regression

    ^

    ^

  • 8/11/2019 Lecture 7 Regression and Correlation

    24/51

    Bivariate RegressionY

    X0 11

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Considering that a regression line minimizes (Y - Y)2,where would the regression line cross for an interval-ratio

    variable regressed on a dichotomous independent variable?

    ^

    For example:

    0=Men: Mean = 6

    1=Women: Mean = 4

  • 8/11/2019 Lecture 7 Regression and Correlation

    25/51

    Bivariate RegressionY

    X0 11

    2

    3

    4

    5

    6

    7

    8

    9

    10

    The difference of means will be the slope.

    This is the same number that is tested for

    significance in an independent samples t-test.

    ^

    0=Men: Mean = 6

    1=Women: Mean = 4

    Slope = -2 ; Y = 62X

  • 8/11/2019 Lecture 7 Regression and Correlation

    26/51

    Correlation

    This lecture hascovered how to modelthe relationshipbetween twovariables withregression.

    Another concept is

    strength ofassociation.

    Correlation providesthat.

  • 8/11/2019 Lecture 7 Regression and Correlation

    27/51

    Correlation

    Y,

    crime

    s

    X,

    self-esteem

    10 15 20 25 30 35 40

    So our equation is:

    Y = 6 - .2X

    The slope tells us direction ofassociation How strong is

    that?

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    ^

  • 8/11/2019 Lecture 7 Regression and Correlation

    28/51

    Correlation

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Example of Low Negative Correlation

    When there is a lot of difference on the dependent variable across

    subjects at particular values of X, there is NOT as much association

    (weaker).

    Y

    X

  • 8/11/2019 Lecture 7 Regression and Correlation

    29/51

    Correlation

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Example of High Negative Correlation

    When there is little difference on the dependent variable across

    subjects at particular values of X, there is MORE association

    (Stronger).

    Y

    X

  • 8/11/2019 Lecture 7 Regression and Correlation

    30/51

    Correlation

    To find the strength of the relationship

    between two variables, we need

    correlation. The correlation is the standardized

    slope it refers to the standard deviation

    change in Y when you go up a standarddeviation in X.

  • 8/11/2019 Lecture 7 Regression and Correlation

    31/51

    Correlation

    The correlation is the standardized slope it refers to the standard

    deviation change in Y when you go up a standard deviation in X.

    (X - X)2

    Recall that s.d. of x, Sx = n - 1

    (Y - Y)2

    and the s.d. of y, Sy = n - 1Sx

    Pearson correlation, r = Sy b

  • 8/11/2019 Lecture 7 Regression and Correlation

    32/51

    Correlation

    The Pearson Correlation, r:

    tells the direction and strength of the

    relationship between continuous variablesranges from -1 to +1

    is + when the relationship is positive and -

    when the relationship is negative the higher the absolute value of r, the stronger

    the association

    a standard deviation change in x corresponds

    with r standard deviation change in Y

  • 8/11/2019 Lecture 7 Regression and Correlation

    33/51

    Correlation

    The Pearson Correlation, r:

    The pearson correlation is a statistic that is an

    inferential statistic too.r - (null = 0)

    tn-2= (1-r2) (n-2)

    When it is significant, there is a relationship in

    the population that is not equal to zero!

  • 8/11/2019 Lecture 7 Regression and Correlation

    34/51

    Error Analysis

    Y = a + bX This equation gives the conditional

    mean of Y at any given value of X.

    So In reality, our line gives us the expected mean of Y given each

    value of X

    The lines equation tells you how the mean on your dependent

    variable changes as your independent variable goes up.

    ^

    Y^

    X

    Y

  • 8/11/2019 Lecture 7 Regression and Correlation

    35/51

    Error Analysis

    As you know, every mean has a distribution around it--so

    there is a standard deviation. This is true for conditional

    means as well. So, you also have a conditional standard

    deviation.

    Conditional Standard Deviationor Root Mean Square Error

    equals approximate average deviation from the line.

    SSE ( Y - Y)2

    = n - 2 = n - 2

    Y^

    X

    Y

    ^

    ^

  • 8/11/2019 Lecture 7 Regression and Correlation

    36/51

    Error Analysis The Assumption of Homoskedasticity:

    The variation around the line is the same no matter the X.

    The conditional standard deviation is for any given value of X.

    If there is a relationship between X and Y, the conditional standarddeviation is going to be less than the standard deviation of Y--if thisis so, you have improved prediction of the mean value of Y bylooking at each level of X.

    If there were no relationship, the conditional standard deviation

    would be the same as the original, and the regression line would beflat at the mean of Y.

    Y

    X

    Y Conditionalstandard

    deviation

    Original

    standard

    deviation

  • 8/11/2019 Lecture 7 Regression and Correlation

    37/51

    Error Analysis

    So guess what?

    We have a way to determine how much our

    understanding of Y is improved when taking Xinto accountit is based on the fact that

    conditional standard deviations should be

    smaller than Ys original standard deviation.

    http://art.intensify.org/digital_art/fan/tv/question.jpg
  • 8/11/2019 Lecture 7 Regression and Correlation

    38/51

    Error Analysis

    Proportional Reduction in Error Lets call the variation around the mean in Y Error 1.

    Lets call the variation around the line when X is consideredError 2.

    But rather than going all the way to standard deviation todetermine error, lets just stop at the basic measure, Sum ofSquared Deviations.

    Error 1 (E1) = (YY)2 also called Sum of Squares

    Error 2 (E2) = (YY)2 also called Sum of Squared Errors

    Y

    X

    Y Error 2Error 1

  • 8/11/2019 Lecture 7 Regression and Correlation

    39/51

    R-Squared

    Proportional Reduction in Error To determine how much taking X into consideration reduces the

    variation in Y (at each level of X) we can use a simple formula:

    E1E2 Which tells us the proportion orE1 percentage of original error that

    is Explained by X.

    Error 1 (E1) = (YY)2

    Error 2 (E2) = (YY)2

    Y

    X

    Y

    Error 2

    Error 1

  • 8/11/2019 Lecture 7 Regression and Correlation

    40/51

    R-squared

    r2= E1 - E2

    E1

    = TSS - SSE

    TSS

    = (YY)2 -(YY)2

    (YY)2

    r2is called the coefficient of

    determination

    It is also the square of the

    Pearson correlation

    Y

    X

    YError 2

    Error 1

  • 8/11/2019 Lecture 7 Regression and Correlation

    41/51

    R-Squared R2:

    Is the improvement obtained by using X (and drawing a linethrough the conditional means) in getting as near as possible toeverybodys value for Y over just using the mean for Y alone.

    Falls between 0 and 1

    Of 1 means an exact fit (and there is no variation of scoresaround the regression line)

    Of 0 means no relationship (and as much scatter as in theoriginal Y variable and a flat regression line through the mean ofY)

    Would be the same for X regressed on Y as for Y regressed on

    X Can be interpreted as the percentage of variability in Y that is

    explained by X.

    Some people get hung up on maximizing R2, but this is too badbecause any effect is still a findinga small R2 only indicates thatyou havent told the whole (or much of the) story with your variable.

  • 8/11/2019 Lecture 7 Regression and Correlation

    42/51

    Error Analysis, SPSSSome SPSS output (Anti- Gay Marriage regressed on Age):

    r

    2

    (YY)2 -(YY)2

    (YY)2

    196.8862853.286= .069

    Line to the Mean

    Data points to the lineData points to

    the mean

    Original SS

    for Anti- Gay

    Marriage

  • 8/11/2019 Lecture 7 Regression and Correlation

    43/51

    Error AnalysisSome SPSS output (Anti- Gay Marriage regressed on Age):

    r2

    (YY)2 -(YY)2

    (YY)2

    196.8862853.286= .069

    Line to the Mean

    Data points to the lineData points to

    the mean

    0 18 45 89 Age

    Strong Oppose 5

    Oppose 4

    Neutral 3

    Support 2

    Strong Support 1

    Anti- GayMarriage

    M = 2.98

    Colored lines are examples of:Distance from each persons data point

    to the line or modelnew, stillunexplained error.

    Distance from line or model to Mean

    for each personreduction in error.

    Distance from each persons data point

    to the Meanoriginal variables error.

  • 8/11/2019 Lecture 7 Regression and Correlation

    44/51

  • 8/11/2019 Lecture 7 Regression and Correlation

    45/51

    Dichotomous VariablesY

    X0 1

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Using a dichotomous independent variable,the ANOVA table in bivariate regression will

    have the same numbers and ANOVA results

    as a one-way ANOVA table would (and

    compare this with an independent samples t-test).

    ^

    0=Men: Mean = 6

    1=Women: Mean = 4

    Slope = -2 ; Y = 62X

    Mean = 5 BSS

    WSS

    TSS

  • 8/11/2019 Lecture 7 Regression and Correlation

    46/51

    Regression, Inferential Statistics

    Descriptive:

    The equation for your line

    is a descriptive statistic.

    It tells you the real, best-

    fitted line that minimizessquared errors.

    Inferential:

    But what about the

    population? What can we

    say about the relationship

    between your variables inthe population???

    The inferential statistics

    are estimates based on

    the best-fitted line.

    Recall that statistics are divided between descriptive

    and inferential statistics.

  • 8/11/2019 Lecture 7 Regression and Correlation

    47/51

    Regression, Inferential Statistics

    The significance of F, you already understand.

    The ratio of Regression (line to the mean of Y) to Residual (line todata point) Sums of Squares forms an F ratio in repeated sampling.

    Null: r2= 0 in the population. If F exceeds critical F, then your

    variables have a relationship in the population (X explains some of

    the variation in Y).

    Most extreme

    5% of Fs

    F = Regression SS / Residual SS

  • 8/11/2019 Lecture 7 Regression and Correlation

    48/51

    Regression, Inferential Statistics What about the Slope or

    Coefficient?

    From sample to sample, different

    slopes would be obtained.

    The slope has a sampling

    distribution that is normally

    distributed.

    So we can do a significance test. -3 -2 -1 0 1 2 3z

  • 8/11/2019 Lecture 7 Regression and Correlation

    49/51

    Regression, Inferential StatisticsConducting a Test of Significance for the slope of the Regression Line

    By slapping the sampling distribution for the slope over a guess of thepopulations slope, Ho, one determines whether a sample couldhave been drawn from a population where the slope is equal Ho.

    1. Two-tailed significance test for -level = .05

    2. Critical t = +/- 1.96

    3. To find if there is a significant slope in the population,

    Ho: = 0

    Ha: 0 ( YY )2

    4. Collect Data n - 25. Calculate t (z): t = bo s.e. =

    s.e. ( XX )2

    6. Make decision about the null hypothesis

    7. Find P-value

  • 8/11/2019 Lecture 7 Regression and Correlation

    50/51

    Correlation and RegressionBack to the SPSS output:

    The standard error and tappears on SPSS output

    and the p-value too!

  • 8/11/2019 Lecture 7 Regression and Correlation

    51/51

    Correlation and RegressionBack to the SPSS output:

    Y = 1.88 + .023X

    So the GSS example, the

    slope is significant. There is

    evidence of a positive

    relationship in the

    population between Age and

    Anti- Gay Marriage

    sentiment. 6.9% of the

    variation in Marriage

    attitude is explained by age.

    The older Americans get, the

    more likely they are to

    oppose gay marriage.

    A one year increase in age elevates antiattitudes by .023 scale units. There is a weak