chapter 9: interpretative aspects of correlation and regression

Chapter 9: Interpretative aspects of correlation and regression

Fun facts about the regression line

Equation of regression line:

If we convert our X and Y scores to zx and zy, the regression line through the z-scores is:

Because the means of the z-scores are zero and the standard deviations are 1.If we convert our scores to z-scores, the slope of the regression line is equal to the correlation.

XY rzz '

10 20 30 40

20

30

40

x

y

r=-0.83, Regression line: Y'=-0.5X+41.5

-3 -2 -1 0 1

-1

0

1

2

zx

zy

r=-0.83, Regression line: Y'=-0.83X+0.00

YXSSrX

SSrY

X

Y

X

Y

Regression to the mean: When |r|<1, the more extreme values of X will tend to be paired to less extreme values of Y.

The slope of the regression line is flatter for lower correlations. This means that the expected values of Y are closer to the mean of Y for lower correlations.

Remember, the slope of the regression line is:

-4 -2 0 2 4

-2

0

2

X

Y

r=0.96

-2 0 2

-2

-1

0

1

2

XY

r=0.26

X

Y

SSr

“I had the most satisfying Eureka experience of my career while attempting to teach flight instructors that praise is more effective than punishment for promoting skill-learning. When I had finished my enthusiastic speech, one of the most seasoned instructors in the audience raised his hand and made his own short speech, which began by conceding that positive reinforcement might be good for the birds, but went on to deny that it was optimal for flight cadets. He said, “On many occasions I have praised flight cadets for clean execution of some aerobatic maneuver, and in general when they try it again, they do worse. On the other hand, I have often screamed at cadets for bad execution, and in general they do better the next time. So please don’t tell us that reinforcement works and punishment does not, because the opposite is the case.”

This was a joyous moment, in which I understood an important truth about the world: because we tend to reward others when they do well and punish them when they do badly, and because there is regression to the mean, it is part of the human condition that we are statistically punished for rewarding others and rewarded for punishing them. I immediately arranged a demonstration in which each participant tossed two coins at a target behind his back, without any feedback. We measured the distances from the target and could see that those who had done best the first time had mostly deteriorated on their second try, and vice versa. But I knew that this demonstration would not undo the effects of lifelong exposure to a perverse contingency.”

-Daniel Kahneman

A classic example of regression to the mean: Correlations between husband and wives’ IQs

Regression to the meanExample: correlation of IQs of husband and wives. The IQ’s of husbands and wives have been found to correlate with r=0.5. Both wives and husbands have mean IQs of 100 and standard deviations of 15. Here’s a scatter plot of a typical sample of 200 couples.

55 70 85 100 115 130 14555

70

85

100

115

130

145

Wife IQ

Hus

band

IQ

n= 200, r= 0.50, Y' = 0.50 X + 50.0

Regression to the meanExample: correlation of IQs of husband and wives. According to the regression line, the expected IQ of a husband of wife with an IQ of 115 should be (0.5)(100)+50 = 107.5. This is above average, but closer to the mean of 100.

55 70 85 100 115 130 14555

70

85

100

115

130

145

Wife IQ

Hus

band

IQ

X = 115, Y' = 0.50 (100) + 50.0 = 107.5

“Homoscedasticity”

Today’s word:

“Homoscedasticity”

Variability around the regression line isconstant.

Variability around the regression line varies with x.

99.125.1151 22 rSS YYX

Interpretation of SYX, the standard error of the estimate

If the points are distributed with ‘homoscedasticity’, then the Y-values should be normally distributed above and below the regression line. SYX is a measure of the standard deviation of this normal distribution. This means that 68% of the scores should fall within +/- 1 standard deviation of the regression line, and 98% should fall within +/- 2 standard deviations of the regression line.

68% of Husband’s IQ fall within +/- 12.99 IQ points of the regression line.

55 70 85 100 115 130 14555

70

85

100

115

130

145

Wife IQ

Hus

band

IQ

Example: correlation of IQs of husband and wives. What percent of women with IQ’s of 115 are married to men with IQ’s of 115 or more?

99.125.1151 22 rSS YYX

To calculate the proportion above 115, we calculate z and use Table A:

z = (115-107.5)/12.99 = .5744

The area above z =.5744 is .2843. So 28.43% of women with IQ’s of 115 are married to men with IQ’s of 115 or more.

Answer: We just calculated that the mean IQ of a man married to a woman with an IQ of 115 is 107.5. If we assume normal distributions and ‘homoscedasticity’, then the standard deviation of the IQ’s of men married to women with IQ’s of 115 is SYX:

Example: The correlation between IQs of twins reared apart was found to be 0.76. Assume that IQs are distributed normally with a mean of 100 and standard deviation of 15 points, and also assume homoscedasticity.

1) Find the regression line that predicts the IQ of one twin based on another’s2) Find the standard error of the estimate SYX.3) What is the mean IQ of a twin that has an IQ of 130?4) What proportion of all twin subjects have an IQ over 130? 5) What proportion of twins that have a sibling with an IQ of 130 have an IQ over

130?

120 155 190 225 260 295 330120

155

190

225

260

295

330

Batting average this year

Bat

ting

aver

age

next

yea

r

Another example:

The year to year correlation for a typical baseball player’s batting averages is 0.41. Suppose that the batting averages for this year is distributed normally with a mean of 225 and a standard deviation of 35.

For players that batted 300 one year, what is the expected distribution of batting averages for next year?

http://www.beyondtheboxscore.com/2011/9/1/2393318/what-hitting-metrics-are-consistent-year-to-year

Answer: The batting averages will be distributed normally with mean determined by the regression line, and the standard deviation equal to the standard error of the mean.

For X = 300, Y’ = .41(300)+132.75 = 255.75

So the expected batting average next year should be distributed normally with a mean of 255.75 and a standard deviation of 31.92. Note that the mean is higher than the overall mean of 225, but lower than the previous year of 300.

Another example:

The year to year correlation for a typical baseball player’s batting averages is 0.41. Suppose that the batting averages for this year is distributed normally with a mean of 225 and a standard deviation of 35.

For players that batted 300 one year, what is the expected distribution of batting averages for next year?

92.3141.1351 22 rSS YYX

75.13241.225225352541.

XXYXX

SSrYX

Y

http://www.beyondtheboxscore.com/2011/9/1/2393318/what-hitting-metrics-are-consistent-year-to-year

Answer: We know that the expected batting average next year should be distributed normally with a mean of 255.75 and a standard deviation of 31.92.

This like our old z-transformation problems.

z = (300-255.75)/31.92 = 1.39

Pr(z>1.39) = .0823.

Only 8.23% of the players will bat 300 or higher.

On the other hand, only 1.61% of all batters will bat 300 or higher.

What percent of players that bat 300 this year will bat 300 or higher next year?

Proportion of variance in Y associated with variance in X.Remember this example? Here are two hypothetical samples showing scatter plots of ages of brides and grooms for a correlation of 1 and a correlation of 0.

Correlation of r=1.0

10 20 30 40

20

30

40

Age of BrideA

ge o

f Gro

om

r=0.00, Regression line: Y'=0.00X+26.8

Correlation of r=0.0

For r=1, all of the variance in Y can be explained (or predicted) by the variance in X.

For r=0, none of the variance in Y can be explained (or predicted) by the variance in X.

So r reflects the amount that variability in Y can be explained by variability in X.

0 20 400

10

20

30

40

50

Age of Bride

Age

of G

room

r=1.00, Regression line: Y'=1.20X-3.32

5 10 15 20 25 30 35

5

10

15

20

25

Stress

Eat

ing

Diff

icul

ties

Y

Y’'YY

YY ' YY

nYY

SYX

2

2 )(

total variance of Y

variance of Y not explained by X

variance of Y explained by X

It turns out that the total variance is the sum of the corresponding component variances.

The deviation between Y and the mean can be broken down into two components:

The total variance is the sum of the variances explained and not explained by x.

nYY

SY

2

2 )(

nYY

SYX

2

2 )(

2'

22YYXY SSS

)'()'( YYYYYY

nYY

SY

2

2'

)'(

The total variance is the sum of the variances explained and not explained by x.

How does this relate to the correlation, r?Let’s look at:

Which is the proportion of total variance explained by X. This is the same as:

2

22

Y

YXY

SSS Remember,

After a little algebra (page 151) we can show that

2'

22YYXY SSS

2

2'

Y

Y

SS

2

22

Y

YXY

SSS 21 rSS YYX

22

2' r

SS

Y

Y

r2 is the proportion of variance in Y explained by variance in X, and is called the coefficient of determination.

The remaining variance, k2 = 1-r2 , is called the coefficient of nondetermination

-20 0 20 400

10

20

30

40

50

x

y

r = 0.71

If r= .7071, then r2 = 0.5, which means that half the variance in Y can be explained by variance in X. The other half cannot be explained by variance in X

22

2' r

SS

Y

Y

Factor that influences r: (1) ‘Range of Talent’ (sometimes called ‘restricted range’)

Example: The IQ’s of husbands and wives.

55 70 85 100 115 130 145

70

85

100

115

130

IQ wife

IQ h

usba

ndr=0.5

Now suppose we were to only sample from women with IQ’s of 115 or higher. This is called ‘restricting the range’.

55 70 85 100 115 130 145

70

85

100

115

130

IQ wife

IQ h

usba

nd

100 115 130 145 160

100

115

130

IQ wife

IQ h

usba

ndr = 0.211

The correlation among these remaining couples’ IQs is much lower

55 70 85 100 115 130 145

70

85

100

115

130

IQ wife

IQ h

usba

nd

r = 0.50

Restricting the range to make a discontinuous distribution can often increase the correlation:

55 70 85 100 115 130 145

70

85

100

115

130

IQ wife

IQ h

usba

nd


r = 0.50

55 70 85 100 115 130 145

70

85

100

115

130

IQ wife

IQ h

usba

nd

r = 0.670


Factor that influences r: (2) ‘Homogeneity of Samples’

Correlation values can be both increased or decreased if we accidentally include two (or more) distinct sub-populations.

60 62 64 66 68 70 7250

55

60

65

70

75

80

Average of parent's height (in)

Hei

ght o

f all

stud

ents

n = 96, r = 0.43

FemaleMale

60 62 64 66 68 70 7250

55

60

65

70

75

80


Fem

ale

stud

ent's

hei

ght (

in)

n = 75, r = 0.56

60 62 64 66 68 70 7250

55

60

65

70

75

80


Mal

e st

uden

t's h

eigh

t (in

)

n = 21, r = 0.44

60 65 70 75

0

0.5

1

1.5

2

2.5

3

Height (in)

Hou

rs p

layi

ng v

ideo

gam

esn=95, r = 0.30

Factor that influences r: (2) ‘Homogeneity of Samples’

Correlation values can be both increased or decreased if we accidentally include two (or more) distinct sub-populations.

Factor that influences r: (2) ‘Homogeneity of Samples’Correlation values can be both increased or decreased if we accidentally include two (or more) distinct sub-populations.

60 65 70 75 800

1

2

3

4

Height(in)

Vid

eo g

ame

play

ing

(mal

e)

n = 21, r = 0.10

60 65 70 75 800

1

2

3

4

Height(in)

Vid

eo g

ame

play

ing

(fem

ale)

n = 73, r = -0.12

60 65 70 75 800

1

2

3

4

Height(in)

Vid

eo g

ame

play

ing

(all)

n = 94, r = 0.20

0 5 105

6

7

8

9

caffeine(cups/day)

slee

p (h

ours

/nig

ht) n = 77, r = -0.24

Factor that influences r: (2) Outliers

Just like the mean, extreme values have a large influence on the calculation of correlation.

0 5 105

6

7

8

9

caffeine(cups/day)

slee

p (h

ours

/nig

ht) n = 76, r = -0.07

chapter 9: interpretative aspects of correlation and regression

Documents

y scores

regression fun facts

standard deviations