lecture 7 regression and correlation

8/11/2019 Lecture 7 Regression and Correlation

1/51

Correlation and

Regression


2/51

Correlation and Regression

The test you choose depends on level of measurement:

Independent Dependent Test

Dichotomous Interval-Ratio Independent Samples t-testDichotomous

Nominal Interval-Ratio ANOVA

Dichotomous Dichotomous

Nominal Nominal Cross Tabs

Dichotomous Dichotomous

Interval-Ratio Interval-Ratio Bivariate Regression/Correlation

Dichotomous


3/51

Correlation and Regression

Bivariate regression is a technique that fits a

straight line as close as possible between all the

coordinates of two continuous variables plottedon a two-dimensional graph--to summarize the

relationship between the variables

Correlation is a statistic that assesses thestrength and direction of association of two

continuous variables . . . It is created through a

technique called regression


4/51

Bivariate Regression

For example:

A criminologist may be interested in the

relationship between Income and Number ofChildren in a family or self-esteem and

criminal behavior.

Independent VariablesFamily Income

Self-esteem

Dependent Variables

Number of Children

Criminal Behavior


5/51


For example:

Research Hypotheses:

As family income increases, the number of children in

families declines (negative relationship).

As self-esteem increases, reports of criminal behavior

increase (positive relationship).


Self-esteem

Dependent Variables

Number of Children

Criminal Behavior


6/51


For example:

Null Hypotheses:

There is no relationship between family income and the

number of children in families. The relationship statistic b = 0.

There is no relationship between self-esteem and criminal

behavior. The relationship statistic b = 0.


Self-esteem

Dependent Variables

Number of Children

Criminal Behavior


7/51

Bivariate Regression Lets look at the relationship between self-

esteem and criminal behavior.

Regression starts with plots of coordinates of

variables in a hypothesis (although you will

hardly ever plotyour data in reality).

The data:

Each respondent has filled out a self-esteem

assessment and reported number of crimes

committed.


8/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

What do you think

the relationship is?

0

1

2

3

4

5

6

7

8

9

10


9/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

Is it positive?

Negative?No change?

0

1

2

3

4

5

6

7

8

9

10


10/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

Regression is a procedure that

fits a line to the data. The slope

of that line acts as a model for

the relationship between theplotted variables.

0

1

2

3

4

5

6

7

8

9

10


11/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

The slope of a line is the change in the correspondingY value for each unit increase in X (rise over run).

Slope = 0, No relationship!

Slope = 0.2, Positive Relationship!

1

Slope = -0.2, Negative Relationship!

1

0.5

0

1

2

3

4

5

6

7

8

9

10


12/51


The mathematical equation for a line:

Y = mx + b

Where: Y = the lines position on thevertical axis at any point

X = the lines position on the

horizontal axis at any point

m = the slope of the lineb = the intercept with the Y axis,

where X equals zero


13/51


The statistics equation for a line:

Y = a + bx

Where: Y = the lines position on the

vertical axis at any point (value ofdependent variable)

X = the lines position on the

horizontal axis at any point (value of

the independent variable)

b = the slope of the line (called the coefficient)

a = the intercept with the Y axis,

where X equals zero

^

^


14/51


The next question:

How do we draw the line???

Our goal for the line:

Fit the line as close as possible to all thedata points for all values of X.


15/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

How do we minimize the

distance between a line and all

the data points?

0

1

2

3

4

5

6

7

8

9

10


16/51


How do we minimize the distance between a line and

all the data points?

You already know of a statistic that minimizes thedistance between itself and all data values for a

variable--the mean!

The mean minimizes the sum of squareddeviations--it is where deviations sum to zero and

where the squared deviations are at their lowest

value. (Y - Y-bar)2


17/51


The mean minimizes the sum of squared

deviations--it is where deviations sum to zero and

where the squared deviations are at their lowestvalue.

Take this principle and fit the line

to the place

where squared deviations (on Y) from the line are

at their lowest value (across all X

s).

(Y - Y)2 Y = line^ ^


18/51


There are several lines that you could draw where the

deviations would sum to zero...

Minimizing the sum of squared errors gives you the

unique, best fitting line for all the data points. It is the

line that is closest to all points.

Y or Y-hat = Y value for line at any X

Y = case value on variable Y Y - Y = residual

(YY) = 0; therefore, we use (Y - Y)2

and minimize that!

^

^

^ ^


19/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

10

Illustration of Y Y

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

5

-4

Y = 10, Y = 5

Y = 0, Y = 4


20/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

10

Illustration of (Y Y)2

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

5

-4

(YiY)2 = deviation2

Y = 10, Y = 5 . . . 25

Y = 0, Y = 4 . . . 16


21/51


22/51


Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

The fitted line for ourexample has the equation:

Y = - X

, where e= distance fromline to data points or error

If you were to draw any other line, it

would not minimize . . .

(Y - Y)201

2

3

4

5

6

7

8

9

10

Y = a+ bX

e


23/51


We use (Y - Y)2 and minimize that!

There is a simple, elegant formula for

discovering

the line that minimizes the sum of

squared errors

((X - X)(Y - Y))

b = (X - X)2 a = Y - bX Y = a + bX

This is the method of least squares, it gives our

least squares estimate and indicates why we call

this technique ordinary least squaresor OLS

regression

^

^


24/51

Bivariate RegressionY

X0 11

2

3

4

5

6

7

8

9

10

Considering that a regression line minimizes (Y - Y)2,where would the regression line cross for an interval-ratio

variable regressed on a dichotomous independent variable?

^

For example:

0=Men: Mean = 6

1=Women: Mean = 4


25/51

Bivariate RegressionY

X0 11

2

3

4

5

6

7

8

9

10

The difference of means will be the slope.

This is the same number that is tested for

significance in an independent samples t-test.

^

0=Men: Mean = 6

1=Women: Mean = 4

Slope = -2 ; Y = 62X


26/51

Correlation

This lecture hascovered how to modelthe relationshipbetween twovariables withregression.

Another concept is

strength ofassociation.

Correlation providesthat.


27/51

Correlation

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

So our equation is:

Y = 6 - .2X

The slope tells us direction ofassociation How strong is

that?

0

1

2

3

4

5

6

7

8

9

10

^


28/51

Correlation

1

2

3

4

5

6

7

8

9

10

Example of Low Negative Correlation

When there is a lot of difference on the dependent variable across

subjects at particular values of X, there is NOT as much association

(weaker).

Y

X


29/51

Correlation

1

2

3

4

5

6

7

8

9

10

Example of High Negative Correlation

When there is little difference on the dependent variable across

subjects at particular values of X, there is MORE association

(Stronger).

Y

X


30/51

Correlation

To find the strength of the relationship

between two variables, we need

correlation. The correlation is the standardized

slope it refers to the standard deviation

change in Y when you go up a standarddeviation in X.


31/51

Correlation

The correlation is the standardized slope it refers to the standard

deviation change in Y when you go up a standard deviation in X.

(X - X)2

Recall that s.d. of x, Sx = n - 1

(Y - Y)2

and the s.d. of y, Sy = n - 1Sx

Pearson correlation, r = Sy b


32/51

Correlation

The Pearson Correlation, r:

tells the direction and strength of the

relationship between continuous variablesranges from -1 to +1

is + when the relationship is positive and -

when the relationship is negative the higher the absolute value of r, the stronger

the association

a standard deviation change in x corresponds

with r standard deviation change in Y


33/51

Correlation

The Pearson Correlation, r:

The pearson correlation is a statistic that is an

inferential statistic too.r - (null = 0)

tn-2= (1-r2) (n-2)

When it is significant, there is a relationship in

the population that is not equal to zero!


34/51

Error Analysis

Y = a + bX This equation gives the conditional

mean of Y at any given value of X.

So In reality, our line gives us the expected mean of Y given each

value of X

The lines equation tells you how the mean on your dependent

variable changes as your independent variable goes up.

^

Y^

X

Y


35/51

Error Analysis

As you know, every mean has a distribution around it--so

there is a standard deviation. This is true for conditional

means as well. So, you also have a conditional standard

deviation.

Conditional Standard Deviationor Root Mean Square Error

equals approximate average deviation from the line.

SSE ( Y - Y)2

= n - 2 = n - 2

Y^

X

Y

^

^


36/51

Error Analysis The Assumption of Homoskedasticity:

The variation around the line is the same no matter the X.

The conditional standard deviation is for any given value of X.

If there is a relationship between X and Y, the conditional standarddeviation is going to be less than the standard deviation of Y--if thisis so, you have improved prediction of the mean value of Y bylooking at each level of X.

If there were no relationship, the conditional standard deviation

would be the same as the original, and the regression line would beflat at the mean of Y.

Y

X

Y Conditionalstandard

deviation

Original

standard

deviation


37/51

Error Analysis

So guess what?

We have a way to determine how much our

understanding of Y is improved when taking Xinto accountit is based on the fact that

conditional standard deviations should be

smaller than Ys original standard deviation.
http://art.intensify.org/digital_art/fan/tv/question.jpg


38/51

Error Analysis

Proportional Reduction in Error Lets call the variation around the mean in Y Error 1.

Lets call the variation around the line when X is consideredError 2.

But rather than going all the way to standard deviation todetermine error, lets just stop at the basic measure, Sum ofSquared Deviations.

Error 1 (E1) = (YY)2 also called Sum of Squares

Error 2 (E2) = (YY)2 also called Sum of Squared Errors

Y

X

Y Error 2Error 1


39/51

R-Squared

Proportional Reduction in Error To determine how much taking X into consideration reduces the

variation in Y (at each level of X) we can use a simple formula:

E1E2 Which tells us the proportion orE1 percentage of original error that

is Explained by X.

Error 1 (E1) = (YY)2

Error 2 (E2) = (YY)2

Y

X

Y

Error 2

Error 1


40/51

R-squared

r2= E1 - E2

E1

= TSS - SSE

TSS

= (YY)2 -(YY)2

(YY)2

r2is called the coefficient of

determination

It is also the square of the

Pearson correlation

Y

X

YError 2

Error 1


41/51

R-Squared R2:

Is the improvement obtained by using X (and drawing a linethrough the conditional means) in getting as near as possible toeverybodys value for Y over just using the mean for Y alone.

Falls between 0 and 1

Of 1 means an exact fit (and there is no variation of scoresaround the regression line)

Of 0 means no relationship (and as much scatter as in theoriginal Y variable and a flat regression line through the mean ofY)

Would be the same for X regressed on Y as for Y regressed on

X Can be interpreted as the percentage of variability in Y that is

explained by X.

Some people get hung up on maximizing R2, but this is too badbecause any effect is still a findinga small R2 only indicates thatyou havent told the whole (or much of the) story with your variable.


42/51

Error Analysis, SPSSSome SPSS output (Anti- Gay Marriage regressed on Age):

r

2

(YY)2 -(YY)2

(YY)2

196.8862853.286= .069

Line to the Mean

Data points to the lineData points to

the mean

Original SS

for Anti- Gay

Marriage


43/51

Error AnalysisSome SPSS output (Anti- Gay Marriage regressed on Age):

r2

(YY)2 -(YY)2

(YY)2

196.8862853.286= .069

Line to the Mean

Data points to the lineData points to

the mean

0 18 45 89 Age

Strong Oppose 5

Oppose 4

Neutral 3

Support 2

Strong Support 1

Anti- GayMarriage

M = 2.98

Colored lines are examples of:Distance from each persons data point

to the line or modelnew, stillunexplained error.

Distance from line or model to Mean

for each personreduction in error.

Distance from each persons data point

to the Meanoriginal variables error.


44/51


45/51

Dichotomous VariablesY

X0 1

1

2

3

4

5

6

7

8

9

10

Using a dichotomous independent variable,the ANOVA table in bivariate regression will

have the same numbers and ANOVA results

as a one-way ANOVA table would (and

compare this with an independent samples t-test).

^

0=Men: Mean = 6

1=Women: Mean = 4

Slope = -2 ; Y = 62X

Mean = 5 BSS

WSS

TSS


46/51

Regression, Inferential Statistics

Descriptive:

The equation for your line

is a descriptive statistic.

It tells you the real, best-

fitted line that minimizessquared errors.

Inferential:

But what about the

population? What can we

say about the relationship

between your variables inthe population???

The inferential statistics

are estimates based on

the best-fitted line.

Recall that statistics are divided between descriptive

and inferential statistics.


47/51

Regression, Inferential Statistics

The significance of F, you already understand.

The ratio of Regression (line to the mean of Y) to Residual (line todata point) Sums of Squares forms an F ratio in repeated sampling.

Null: r2= 0 in the population. If F exceeds critical F, then your

variables have a relationship in the population (X explains some of

the variation in Y).

Most extreme

5% of Fs

F = Regression SS / Residual SS


48/51

Regression, Inferential Statistics What about the Slope or

Coefficient?

From sample to sample, different

slopes would be obtained.

The slope has a sampling

distribution that is normally

distributed.

So we can do a significance test. -3 -2 -1 0 1 2 3z


49/51

Regression, Inferential StatisticsConducting a Test of Significance for the slope of the Regression Line

By slapping the sampling distribution for the slope over a guess of thepopulations slope, Ho, one determines whether a sample couldhave been drawn from a population where the slope is equal Ho.

1. Two-tailed significance test for -level = .05

2. Critical t = +/- 1.96

3. To find if there is a significant slope in the population,

Ho: = 0

Ha: 0 ( YY )2

4. Collect Data n - 25. Calculate t (z): t = bo s.e. =

s.e. ( XX )2

6. Make decision about the null hypothesis

7. Find P-value


50/51

Correlation and RegressionBack to the SPSS output:

The standard error and tappears on SPSS output

and the p-value too!


51/51

Correlation and RegressionBack to the SPSS output:

Y = 1.88 + .023X

So the GSS example, the

slope is significant. There is

evidence of a positive

relationship in the

population between Age and

Anti- Gay Marriage

sentiment. 6.9% of the

variation in Marriage

attitude is explained by age.

The older Americans get, the

more likely they are to

oppose gay marriage.

A one year increase in age elevates antiattitudes by .023 scale units. There is a weak

lecture 7 regression and correlation

Documents