lecture 10 correlation and regression. introduction to correlation we’ve been studying the...

Lecture 10

Correlation and Regression

Introduction to Correlation We’ve been studying the differences

between groups… Now, let’s describe the relationship

between variables

Two “interwoven” areas of statistics concern the relationship between variables: correlation and regression

Correlation Correlations describe the relationship

between two variables– They refer to the extent to which one variable is related

to another variable

– The variables that are observed typically occur naturally. Usually there is no manipulation

Correlations exist when:– a change in one variable is accompanied by a

consistent change in the other

– Requires 2 scores for each individual (usually identified by X and Y)

For example, when you applied to college you probably had to submit your SAT scores. Admissions counselors often look at SAT scores because there is a relationship between SAT scores and collegiate achievement. Such that people who did poorly on the SATs tend to get poor grades in college and people that did well on the SATs get good grades in college.

Notation Correlations are symbolized as r, when

referring to correlations based on samples– Correlations are almost always based on

samples

(rho) is the symbol when referring to a correlation that is presumed to exist (or not; if a null hypothesis) in a population

Characteristics of Correlation

Direction

Form

Degree

Direction of the Relationship There are three directions for correlations:

– Positive correlation: this is when an increase in one variable is accompanied by an increase in the other. Variables go in the same direction: low score on one low score on the other, hi score on one hi on the other. (sign is +)

– Negative correlation: this is when an increase in one variable is accompanied by a decrease in the other. Variables go in opposite directions: low score on one hi score on the other. (sign is -)

– Zero correlation: this is when there is no relationship between the two variables

Scatterplots Scatterplots are graphic representations

that display correlations - allow you to see general patterns or trends

In scatterplots the X values are placed on the horizontal axis and the Y values are placed on the vertical axis

Each individual is plotted as a single data point

Positive correlation: “+”; same direction– The time you spend in the shower each morning is

positively correlated with the amount of your hot water bill. If you take long showers your hot water bill is higher.

Negative correlation: “-”; different directions– The number of hours you exercise everyday is

negatively correlated with your heart rate. If you exercise 3 hours per week your hr is lower at 20 beats in a half minute than if you exercise 2.25 hours per week where hr is 30 beats per half min.

Examples Zero

As age increases there is no effect on your cholesterol.

Form of the Relationship The most common form of a relationship is a

straight line.

* Note: the red line is only added to serve as a visual aid, the actual data are the points in the figures

100

150

200

250

300

10 20 30 40 50 60 70

Age

Rea

ctio

n T

ime

0

5

10

15

20

25

30

10 20 30 40 50 60

Dosage (mg)

Moo

d S

core

– However, other forms do exist and there are special correlation equations used to measure them. We’ll focus primarily on linear.

Degree of the Relationship A correlation measures how well the data fit

the specific form being considered.– Remember the most common form is going to be

linear, so the degree will measure how well the data points fit a straight line.

A perfect correlation is identified by a 1.0 or -1.0 and indicates a perfect fit. A correlation of 0 indicates no fit at all. Intermediate values are going to represent the degree to which there is a consistent, predictable relationship in the data. (Nothing greater than 1 or -1).

Remember look at degree and direction separately (you can have a perfect neg. correlation).

Degree of Relationship Rough idea:

– r < 0.29 (small correlation, weak relationship)

– r 0.3 - 0.49 (medium correlation / relationship)

– r 0.5 - 1.0 (large correlation / strong relationship)

* Note: drawing a line around the data points, called an envelope can help you see an overall trend, both in terms of direction and degree (the skinnier the envelope the stronger the correlation).

Why are Correlations Used?(1) Prediction: If 2 variables are known to be related

in a systematic way we can use one variable to predict the other.– Admissions counselors want to see the SATs of

applicants to their college because they use it to predict how well that student might do if they are admitted. Hi SATs predict Hi grades. Prediction is not perfectly accurate, but helpful

(2) Validity: Remember validity is the likelihood that the test we are using measure what we want it to measure.– Say we come up with a new measure to test for

depression. If our measure truly measures depression it should correlate with other measures of depression.

Why are Correlations Used?(3) Reliability: Remember reliability is the

likelihood that our measurement is stable.– If I measure your weight on Tues. it should be very similar

on Wed. and Thurs. If not we would suspect that something is wrong with our scale. When reliability is high the correlation between on weight on Tues. should be highly correlated with our weight on Wed. and Thurs.

(4) Theory Verification: many scientific theories make specific predictions about the relationship between variables– If I have a theory that the oldest child in the family makes

earns the highest salary. Then I should expect that there is a correlation between birth order and salary, such that younger children earn less and oldest children earn more.

Putting it Together If the world were fair, would you expect a

positive or negative relationship between number of hours worked each week and salary?

Data suggest that on average children from large families have lower IQs than children from small families. Do these data suggest a positive or negative correlation between family size and average IQ?

The Pearson Correlation Pearson correlation: measures the

relationship between the degree and direction of a linear relationship.– Assumptions: normal distribution and interval

or ratio data– By far the most commonly used (other

correlations typically relate to this one)– r = degree to which X and Y vary together

degree to which X and Y vary separately

OR– r = covariability of X and Y

variability of X and Y separately

Calculating the Pearson Correlation: SP

To calculate Pearson: Sum of Products Deviations (similar to SS - read Box 16.1 pg. 529)

SP will measure the amount of covariability between 2 variables

Definitional Formula:

SP = (X - MX) * (Y - MY) Computational Formula:

SP = (XY) - (X Y / n)

Note: like SS because the computational formula works with the original scores and not deviations it is usually easier. However, both formulas will always produce the same value for SP.

Let’s calculate SPScores Deviations Products XY

X Y X - MX Y-MY (X - MX)(Y-YY )

1

2

4

5

12

3

6

4

7

20

-2

-1

1

2

-2

1

-1

2

4

-1

-1

4

6

3

12

16

35

66

SP can be + / - / or 0 depending on the direction of the relationship

Definitional Formula: Computational Formula:SP = ((X - MX) * (Y - MY)) SP = (XY) - (X Y / n)SP = 6 SP =66 - ((12)*(20)/ 4)

SP = 6

Putting it Together: Pearson Correlation r = SP / SSXSSY

– Numerator is measuring how XY and vary together (sharing common variance) and denominator is measuring how they vary separately.

– r is also called the correlation coefficient

Finish our calculation:

SP = 6 / 10(10) = .6

Scores Deviations Products XYX Y X - X Y-Y (X - X)(Y-Y)

1

2

4

5

12

3

6

4

7

20

-2

-1

1

2

-2

1

-1

2

4

-1

-1

4

6

3

12

16

35

66

Scatter plot

Scatterplot

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6

X values

Y v

alu

es

Pearson Correlation With Z-scores Conceptually the Pearson correlation measures

an individual’s location in an X distribution and his/her location in a Y distribution.– z-scores are a precise why of identifying the location

of a score.

Pearson with z - scores:– r = (zxzy) / n, where n = # pairs not scores

– In this formula zx identifies each individual’s positions with the X distribtion and zy identifies each person’s position in the Y distribution. The product determines the strength and direction (+/-).

– Good if you already have z-scores, otherwise a pain! So, it is rarely used to calculate a correlation.

ExampleX ZxX Y ZY ZX ZY

1

3

5

7

9

11

13

-1.5

-1

-.5

0

.5

1

1.5

4

7

10

13

16

19

22

-1.5

-1.0

-.5

0

.5

1.0

1.5

2.25

1

.25

0

.25

1

2.25

7

r = (zxzy) / n

r = 7 / 7 = 1.0, so a perfect correlation

You Try One X X2 Zx Y Y2 Zy X Y

2

4

6

8

10

12

14

16

4

16

36

64

100

144

196

256

-1.5

-1.1

-.65

-.22

.22

.65

1.1

1.5

4

8

8

12

12

16

16

20

16

64

64

144

144

256

256

400

-1.6

-.82

-.82

0

0

.82

.82

1.63

8

32

48

96

120

192

224

320

Do both the raw score Pearson and the z-score to compare.

r = (zxzy) / N r = SP / SSXSSY

Correlation and Causation(1) Correlation only describes a

relationship. It does not tell us why.

Correlation = CausationRelationship between Churches and Crime

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

# of churches

# se

rio

us

crim

es

Correlation and Range

Correlation is affected greatly by the range of scores represented in the data– Use caution in interpreting a correlation

where the full range is not used.

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10

x values

y va

lues

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

x values

Y v

alu

es

Outliers Outliers: an individual score with and X

and/or Y values that is substantially greater or smaller than the values obtained for the other individuals in the data set.– Significantly alters the correlation

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16

X values

Y v

alu

es

Interpreting the Strength of the Relationship We already know the degree of the

relationship between 2 variables can be between -1.0 and 1.0

How do we measure the strength of a correlation?– r2 = measures the proportion of the variability in

the data that is explained by the relationship between X and Y.

– Coefficient of determination - because it measure the proportion of variability in 1 variable that can be determined from the relationship with the other.

• e.g. r = .80 then r2 = .64 or 64% of the variability in X can be explained by Y.

• OR r = .60 then r2 = .36 or 36% of the variability in X can be explained by Y.

• Finally when r = 1.0 then r2 = 1.0 or 100% of the variability in X can be explained by Y.

Coefficient of Determination, r2

X Y X Y X&Y

Magnitude: 0.01 < r2 < 0.09 small effect 0.09 < r2 < 0.25 medium effect r2 > 0.25 large effect

Regression Toward the Mean Regression toward the mean is a simple

observation about correlations.– When there is a less than perfect

correlation between two variables, extreme scores (hi or lo) on one variable tend to be paired with less extreme scores on the second variable

• E.g. If we are correlated the intelligence of fathers and their children. It is likely that if the father has an extremely low IQ the child’s IQ will be fairly close to the population mean IQ. If the child has an extremely high IQ it is likely that the father’s will be lower, closer to the mean. (However, there may be exceptions to this).

Hypothesis Tests with Pearson(1) State the Hypothesis (Remember we want to know if

the correlation exists in the population or is due to chance)

* H0 = = 0 or H0 = 0 or 0

* H1 = = 0 or H1 = > 0 or < 0 (2) Set critical region in Table B.6 (must meet or exceed

the critical values) need sample size (n) magnitude of the correlation (independent of sign) df

– df for Pearson is n-2, where n is sample size.• This is because we have 2 restrictions on variability the mean and also

the relationship between X and Y• Or conceptually you have to have 2 points to make a line

* Is the correlation from our example problem significant?

Sample size is particularly important when calculating correlations, b/c small samples can show a non-zero correlation when the population correlation is 0.

• See Table B.6 to further illustrate this point

In the Literature

A correlation for the data revealed that amount of education and annual income were significantly related, r = +.65, n = 30, p < .01, two-tailed.

With correlation it is useful to report:– The sample size– Calculated value for the correlation– Whether it is significant– The probability– Type of test (1- or 2-tailed)

Correlation Matrix A Correlation Matrix is often used to

visualize many correlations.– A study might look at several variables and

correlations between all possible variable pairings are computed.

n = 37

** p< .01 2-tailed

* p< .05 1-tailed

Salary IQ Age EducationSalaryIQ 0.65**Age 0.55** 0.09Education 0.7** 0.33* 0.17

* Which 2 groups correlate the most?

* Which 2 correlate the least?

* What is the correlation between Education and IQ?

What is the correlation coefficient for Age and Salary?

The Spearman Correlation Uses: (1) Measures the relationship

between variables measured on an ordinal scale of measurement. A Pearson with ordinal data.

• Remember ordinal scales involved data that are placed into rank order

– e.g. It might be easy for an coach to rank order players by their athletic ability, but very difficult to measure athletic ability on some other scale

(2) Instead of measuring a linear relationship can measure consistency in a relationship, independent of form

NOTE: We’ve been using mostly parametric tests in this class. Parametric tests use actual #s. Non-parametric tests look at ranks only.

Consistency of Relationship How does Spearman measure

consistency?– When 2 variables are consistently there ranks

will be linearly related. • E.g. a consistently positive relationship means that

every time an X variable increases, the Y variable also increases. So, the smallest value of X is paired with the smallest value of Y.

0

2

4

6

8

10

12

0 2 4 6 8 10 120

1

2

3

4

5

0 1 2 3 4 5

How to Use the Spearman:Wanting to Measure Consistency If the original data are ordinal we can

plug them into our formula (Yes, the same formula we used for the Pearson).

If we want to measure consistency independent of form and our original data are not ordinal:– Convert the scores to rank

* Note: When there is a consistently one directional relationship between two variables is it said to be monotonic. The Spearman measure the degree to which there is a monotonic relationship between 2 variables.

Calculating the Spearman(1) Start with ordinal data or rank the

data.

(2) Using the same formula as the Pearson (using ranks). Calculated your correlation

– Ranking: the smallest X gets assigned a rank of 1, second smallest a 2 and so on. The smallest Y gets assigned a rank of 1, second smallest a 2 and so on.

Spearman Correlation = rs, to differentiate it from the Pearson

r = SP / SSXSSY

Examplers = SP / SSXSSY

X rank X Y rank Y rankXY140

120

136

100

129

125

6

2

5

1

4

3

21

63

70

72

69

65

71

1

4

6

3

2

5

21

6

8

30

3

8

15

70

Note - because the ranks and sums of the ranks are identical for X and Y the SS will be the same for both

SS = 17.5 for each

SP = -3.5 (unlike SS, SP can be negative)

rs = -3.5 / 17.5 * 17.5 = -.2

Example: What if We Have Tied Scores(1) List the scores in order from smallest to largest,

including tied values.(2) Assign a rank to each position(3) When 2 or more scores are tied, compute the mean

and give each tied score the mean rank

X rank X final rank X Y rank Y final rank Y rankXY

120

120

130

140

140

150

1

2

3

4

5

6

63

65

69

69

70

72

1

2

3

4

5

6

1

2

3.5

3.5

5

6

1.5

1.5

3

4.5

4.5

6

1.5

3

10.5

15.75

22.5

36

Finish finding the Spearman Correlation with the new ranks.

Alternative Formula for the Spearman: Try it out!

rs = 1 - ((6D2) / (n (n2 - 1))

X rank X Y rank Y D D2

140

120

136

100

129

125

6

2

5

1

4

3

63

70

72

69

65

71

1

4

6

3

2

5

21

5

-2

-1

-2

2

-2

0

25

4

1

4

4

4

42

* NOTE: for more on why this new formula can be used see page 546-547 in your book.

Testing the Significance of the Spearman

The hypotheses are the same as with the Pearson.

* H0 = = 0 or H0 = 0 or 0

* H1 = = 0 or H1 = > 0 or < 0

Only the table to look for significance is different. See Table B.7– Note, the only difference is that the first

column requires sample size (n) versus degrees of freedom.

* Is the correlation from our example problem significant?

Regression Regression statistical procedure that describes

the linear relationship of a correlation using an equation that identifies and defines the straight line that provides the best fit for any specific data set. (regression line)– If r = ± 1it’s easy to predict & draw the line– If r = ± 1 you must draw a “best fit” line

Uses:– Makes the relationship between the 2 variables

easier to see.– The line identifies the center, or “central tendency”

of the relationship.– Prediction - The line establishes a precise

relationship between each X value and a corresponding Y value.

Linear Relationship Remember from algebra: Y = bx + a,

where b = slope and a = the Y-intercept (remember the y - intercept is the value of y when x is 0).– Y = 0.75X + 3

Scatterplot

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6

X values

Y v

alu

es

Least - Squares To determine how well a line fits data

points we have to calculate how much distance there is between the line and the data point. Best fit = smallest error. – Distance = Y - Ŷ (where Ŷ is the predicted

value of Y)– Like SS we need to square these scores to

obtained a uniformly positive measure of error otherwise sum deviations = 0.

– (Y - Ŷ)2

– Best fit = smallest total squared error OR Least squared error

Calculating Regression To find the best fit line:

X Y XY

1

2

3

4

5

15

5

4

3

2

1

15

5

8

9

8

5

35

SP = -10

r = -1.0

b = -10 / 10 = -1

a = 3 - (-1)(3) = 6

Ŷ = -1X + 6

Ŷ = bX + a b = SP / SSx a = MY - bMX

r = SP / SSXSSY

Prediction With Regression The predicted value with the equation will not be

perfect unless r = 1.0 The regression equation should not be used to make

predictions that fall outside the range of the values covered in the original data.

Scatterplot

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6

X values

Y v

alu

es

Drawing a Regression Line Draw a Regression line for Ŷ = -1X + 6

– Pick at least 3 reasonable values for X– Plug into the equation and solve for Y– Plot X, Y points– Connect the dots with a line

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4 5 6

Standard Error of the Estimate Accuracy depends on how well the line fits

the points (it’s possible to have the same regression line for 2 different data sets and where one has very little error and the other a lot, see page 550 in the book)

Standard error of estimate give the measure of the standard distance between a regression line and the actual data points– SSerror = (Y - Ŷ)2

– Variance = SS / df– Standard error of estimate = variance

Let’s Do OneCalculate a regression line and the

standard error of the estimate for the following data:X

2

4

6

8

10

12

Y

4

6

6

10

10

12

SSerror = (Y - Ŷ)2

Variance = SS / dfStandard error of estimate = variance

Ŷ = bX + a b = SP / SSx a = MY - bMX

r = SP / SSXSSY

Homework: Chapter 16

1, 2, 5, 6, 8, 11, 12, 13, 15, 18, 25, 26, 27

lecture 10 correlation and regression. introduction to correlation we’ve been studying the...

Documents