lecture 10 correlation and regression. introduction to correlation we’ve been studying the...
Post on 21-Dec-2015
222 Views
Preview:
TRANSCRIPT
Introduction to Correlation We’ve been studying the differences
between groups… Now, let’s describe the relationship
between variables
Two “interwoven” areas of statistics concern the relationship between variables: correlation and regression
Correlation Correlations describe the relationship
between two variables– They refer to the extent to which one variable is related
to another variable
– The variables that are observed typically occur naturally. Usually there is no manipulation
Correlations exist when:– a change in one variable is accompanied by a
consistent change in the other
– Requires 2 scores for each individual (usually identified by X and Y)
For example, when you applied to college you probably had to submit your SAT scores. Admissions counselors often look at SAT scores because there is a relationship between SAT scores and collegiate achievement. Such that people who did poorly on the SATs tend to get poor grades in college and people that did well on the SATs get good grades in college.
Notation Correlations are symbolized as r, when
referring to correlations based on samples– Correlations are almost always based on
samples
(rho) is the symbol when referring to a correlation that is presumed to exist (or not; if a null hypothesis) in a population
Direction of the Relationship There are three directions for correlations:
– Positive correlation: this is when an increase in one variable is accompanied by an increase in the other. Variables go in the same direction: low score on one low score on the other, hi score on one hi on the other. (sign is +)
– Negative correlation: this is when an increase in one variable is accompanied by a decrease in the other. Variables go in opposite directions: low score on one hi score on the other. (sign is -)
– Zero correlation: this is when there is no relationship between the two variables
Scatterplots Scatterplots are graphic representations
that display correlations - allow you to see general patterns or trends
In scatterplots the X values are placed on the horizontal axis and the Y values are placed on the vertical axis
Each individual is plotted as a single data point
Positive correlation: “+”; same direction– The time you spend in the shower each morning is
positively correlated with the amount of your hot water bill. If you take long showers your hot water bill is higher.
Negative correlation: “-”; different directions– The number of hours you exercise everyday is
negatively correlated with your heart rate. If you exercise 3 hours per week your hr is lower at 20 beats in a half minute than if you exercise 2.25 hours per week where hr is 30 beats per half min.
Form of the Relationship The most common form of a relationship is a
straight line.
* Note: the red line is only added to serve as a visual aid, the actual data are the points in the figures
100
150
200
250
300
10 20 30 40 50 60 70
Age
Rea
ctio
n T
ime
0
5
10
15
20
25
30
10 20 30 40 50 60
Dosage (mg)
Moo
d S
core
– However, other forms do exist and there are special correlation equations used to measure them. We’ll focus primarily on linear.
Degree of the Relationship A correlation measures how well the data fit
the specific form being considered.– Remember the most common form is going to be
linear, so the degree will measure how well the data points fit a straight line.
A perfect correlation is identified by a 1.0 or -1.0 and indicates a perfect fit. A correlation of 0 indicates no fit at all. Intermediate values are going to represent the degree to which there is a consistent, predictable relationship in the data. (Nothing greater than 1 or -1).
Remember look at degree and direction separately (you can have a perfect neg. correlation).
Degree of Relationship Rough idea:
– r < 0.29 (small correlation, weak relationship)
– r 0.3 - 0.49 (medium correlation / relationship)
– r 0.5 - 1.0 (large correlation / strong relationship)
* Note: drawing a line around the data points, called an envelope can help you see an overall trend, both in terms of direction and degree (the skinnier the envelope the stronger the correlation).
Why are Correlations Used?(1) Prediction: If 2 variables are known to be related
in a systematic way we can use one variable to predict the other.– Admissions counselors want to see the SATs of
applicants to their college because they use it to predict how well that student might do if they are admitted. Hi SATs predict Hi grades. Prediction is not perfectly accurate, but helpful
(2) Validity: Remember validity is the likelihood that the test we are using measure what we want it to measure.– Say we come up with a new measure to test for
depression. If our measure truly measures depression it should correlate with other measures of depression.
Why are Correlations Used?(3) Reliability: Remember reliability is the
likelihood that our measurement is stable.– If I measure your weight on Tues. it should be very similar
on Wed. and Thurs. If not we would suspect that something is wrong with our scale. When reliability is high the correlation between on weight on Tues. should be highly correlated with our weight on Wed. and Thurs.
(4) Theory Verification: many scientific theories make specific predictions about the relationship between variables– If I have a theory that the oldest child in the family makes
earns the highest salary. Then I should expect that there is a correlation between birth order and salary, such that younger children earn less and oldest children earn more.
Putting it Together If the world were fair, would you expect a
positive or negative relationship between number of hours worked each week and salary?
Data suggest that on average children from large families have lower IQs than children from small families. Do these data suggest a positive or negative correlation between family size and average IQ?
The Pearson Correlation Pearson correlation: measures the
relationship between the degree and direction of a linear relationship.– Assumptions: normal distribution and interval
or ratio data– By far the most commonly used (other
correlations typically relate to this one)– r = degree to which X and Y vary together
degree to which X and Y vary separately
OR– r = covariability of X and Y
variability of X and Y separately
Calculating the Pearson Correlation: SP
To calculate Pearson: Sum of Products Deviations (similar to SS - read Box 16.1 pg. 529)
SP will measure the amount of covariability between 2 variables
Definitional Formula:
SP = (X - MX) * (Y - MY) Computational Formula:
SP = (XY) - (X Y / n)
Note: like SS because the computational formula works with the original scores and not deviations it is usually easier. However, both formulas will always produce the same value for SP.
Let’s calculate SPScores Deviations Products XY
X Y X - MX Y-MY (X - MX)(Y-YY )
1
2
4
5
12
3
6
4
7
20
-2
-1
1
2
-2
1
-1
2
4
-1
-1
4
6
3
12
16
35
66
SP can be + / - / or 0 depending on the direction of the relationship
Definitional Formula: Computational Formula:SP = ((X - MX) * (Y - MY)) SP = (XY) - (X Y / n)SP = 6 SP =66 - ((12)*(20)/ 4)
SP = 6
Putting it Together: Pearson Correlation r = SP / SSXSSY
– Numerator is measuring how XY and vary together (sharing common variance) and denominator is measuring how they vary separately.
– r is also called the correlation coefficient
Finish our calculation:
SP = 6 / 10(10) = .6
Scores Deviations Products XYX Y X - X Y-Y (X - X)(Y-Y)
1
2
4
5
12
3
6
4
7
20
-2
-1
1
2
-2
1
-1
2
4
-1
-1
4
6
3
12
16
35
66
Pearson Correlation With Z-scores Conceptually the Pearson correlation measures
an individual’s location in an X distribution and his/her location in a Y distribution.– z-scores are a precise why of identifying the location
of a score.
Pearson with z - scores:– r = (zxzy) / n, where n = # pairs not scores
– In this formula zx identifies each individual’s positions with the X distribtion and zy identifies each person’s position in the Y distribution. The product determines the strength and direction (+/-).
– Good if you already have z-scores, otherwise a pain! So, it is rarely used to calculate a correlation.
ExampleX ZxX Y ZY ZX ZY
1
3
5
7
9
11
13
-1.5
-1
-.5
0
.5
1
1.5
4
7
10
13
16
19
22
-1.5
-1.0
-.5
0
.5
1.0
1.5
2.25
1
.25
0
.25
1
2.25
7
r = (zxzy) / n
r = 7 / 7 = 1.0, so a perfect correlation
You Try One X X2 Zx Y Y2 Zy X Y
2
4
6
8
10
12
14
16
4
16
36
64
100
144
196
256
-1.5
-1.1
-.65
-.22
.22
.65
1.1
1.5
4
8
8
12
12
16
16
20
16
64
64
144
144
256
256
400
-1.6
-.82
-.82
0
0
.82
.82
1.63
8
32
48
96
120
192
224
320
Do both the raw score Pearson and the z-score to compare.
r = (zxzy) / N r = SP / SSXSSY
Correlation and Causation(1) Correlation only describes a
relationship. It does not tell us why.
Correlation = CausationRelationship between Churches and Crime
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70
# of churches
# se
rio
us
crim
es
Correlation and Range
Correlation is affected greatly by the range of scores represented in the data– Use caution in interpreting a correlation
where the full range is not used.
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8 9 10
x values
y va
lues
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6
x values
Y v
alu
es
Outliers Outliers: an individual score with and X
and/or Y values that is substantially greater or smaller than the values obtained for the other individuals in the data set.– Significantly alters the correlation
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16
X values
Y v
alu
es
Interpreting the Strength of the Relationship We already know the degree of the
relationship between 2 variables can be between -1.0 and 1.0
How do we measure the strength of a correlation?– r2 = measures the proportion of the variability in
the data that is explained by the relationship between X and Y.
– Coefficient of determination - because it measure the proportion of variability in 1 variable that can be determined from the relationship with the other.
• e.g. r = .80 then r2 = .64 or 64% of the variability in X can be explained by Y.
• OR r = .60 then r2 = .36 or 36% of the variability in X can be explained by Y.
• Finally when r = 1.0 then r2 = 1.0 or 100% of the variability in X can be explained by Y.
Coefficient of Determination, r2
X Y X Y X&Y
Magnitude: 0.01 < r2 < 0.09 small effect 0.09 < r2 < 0.25 medium effect r2 > 0.25 large effect
Regression Toward the Mean Regression toward the mean is a simple
observation about correlations.– When there is a less than perfect
correlation between two variables, extreme scores (hi or lo) on one variable tend to be paired with less extreme scores on the second variable
• E.g. If we are correlated the intelligence of fathers and their children. It is likely that if the father has an extremely low IQ the child’s IQ will be fairly close to the population mean IQ. If the child has an extremely high IQ it is likely that the father’s will be lower, closer to the mean. (However, there may be exceptions to this).
Hypothesis Tests with Pearson(1) State the Hypothesis (Remember we want to know if
the correlation exists in the population or is due to chance)
* H0 = = 0 or H0 = 0 or 0
* H1 = = 0 or H1 = > 0 or < 0 (2) Set critical region in Table B.6 (must meet or exceed
the critical values) need sample size (n) magnitude of the correlation (independent of sign) df
– df for Pearson is n-2, where n is sample size.• This is because we have 2 restrictions on variability the mean and also
the relationship between X and Y• Or conceptually you have to have 2 points to make a line
* Is the correlation from our example problem significant?
Sample size is particularly important when calculating correlations, b/c small samples can show a non-zero correlation when the population correlation is 0.
• See Table B.6 to further illustrate this point
In the Literature
A correlation for the data revealed that amount of education and annual income were significantly related, r = +.65, n = 30, p < .01, two-tailed.
With correlation it is useful to report:– The sample size– Calculated value for the correlation– Whether it is significant– The probability– Type of test (1- or 2-tailed)
Correlation Matrix A Correlation Matrix is often used to
visualize many correlations.– A study might look at several variables and
correlations between all possible variable pairings are computed.
n = 37
** p< .01 2-tailed
* p< .05 1-tailed
Salary IQ Age EducationSalaryIQ 0.65**Age 0.55** 0.09Education 0.7** 0.33* 0.17
* Which 2 groups correlate the most?
* Which 2 correlate the least?
* What is the correlation between Education and IQ?
What is the correlation coefficient for Age and Salary?
The Spearman Correlation Uses: (1) Measures the relationship
between variables measured on an ordinal scale of measurement. A Pearson with ordinal data.
• Remember ordinal scales involved data that are placed into rank order
– e.g. It might be easy for an coach to rank order players by their athletic ability, but very difficult to measure athletic ability on some other scale
(2) Instead of measuring a linear relationship can measure consistency in a relationship, independent of form
NOTE: We’ve been using mostly parametric tests in this class. Parametric tests use actual #s. Non-parametric tests look at ranks only.
Consistency of Relationship How does Spearman measure
consistency?– When 2 variables are consistently there ranks
will be linearly related. • E.g. a consistently positive relationship means that
every time an X variable increases, the Y variable also increases. So, the smallest value of X is paired with the smallest value of Y.
0
2
4
6
8
10
12
0 2 4 6 8 10 120
1
2
3
4
5
0 1 2 3 4 5
How to Use the Spearman:Wanting to Measure Consistency If the original data are ordinal we can
plug them into our formula (Yes, the same formula we used for the Pearson).
If we want to measure consistency independent of form and our original data are not ordinal:– Convert the scores to rank
* Note: When there is a consistently one directional relationship between two variables is it said to be monotonic. The Spearman measure the degree to which there is a monotonic relationship between 2 variables.
Calculating the Spearman(1) Start with ordinal data or rank the
data.
(2) Using the same formula as the Pearson (using ranks). Calculated your correlation
– Ranking: the smallest X gets assigned a rank of 1, second smallest a 2 and so on. The smallest Y gets assigned a rank of 1, second smallest a 2 and so on.
Spearman Correlation = rs, to differentiate it from the Pearson
r = SP / SSXSSY
Examplers = SP / SSXSSY
X rank X Y rank Y rankXY140
120
136
100
129
125
6
2
5
1
4
3
21
63
70
72
69
65
71
1
4
6
3
2
5
21
6
8
30
3
8
15
70
Note - because the ranks and sums of the ranks are identical for X and Y the SS will be the same for both
SS = 17.5 for each
SP = -3.5 (unlike SS, SP can be negative)
rs = -3.5 / 17.5 * 17.5 = -.2
Example: What if We Have Tied Scores(1) List the scores in order from smallest to largest,
including tied values.(2) Assign a rank to each position(3) When 2 or more scores are tied, compute the mean
and give each tied score the mean rank
X rank X final rank X Y rank Y final rank Y rankXY
120
120
130
140
140
150
1
2
3
4
5
6
63
65
69
69
70
72
1
2
3
4
5
6
1
2
3.5
3.5
5
6
1.5
1.5
3
4.5
4.5
6
1.5
3
10.5
15.75
22.5
36
Finish finding the Spearman Correlation with the new ranks.
Alternative Formula for the Spearman: Try it out!
rs = 1 - ((6D2) / (n (n2 - 1))
X rank X Y rank Y D D2
140
120
136
100
129
125
6
2
5
1
4
3
63
70
72
69
65
71
1
4
6
3
2
5
21
5
-2
-1
-2
2
-2
0
25
4
1
4
4
4
42
* NOTE: for more on why this new formula can be used see page 546-547 in your book.
Testing the Significance of the Spearman
The hypotheses are the same as with the Pearson.
* H0 = = 0 or H0 = 0 or 0
* H1 = = 0 or H1 = > 0 or < 0
Only the table to look for significance is different. See Table B.7– Note, the only difference is that the first
column requires sample size (n) versus degrees of freedom.
* Is the correlation from our example problem significant?
Regression Regression statistical procedure that describes
the linear relationship of a correlation using an equation that identifies and defines the straight line that provides the best fit for any specific data set. (regression line)– If r = ± 1it’s easy to predict & draw the line– If r = ± 1 you must draw a “best fit” line
Uses:– Makes the relationship between the 2 variables
easier to see.– The line identifies the center, or “central tendency”
of the relationship.– Prediction - The line establishes a precise
relationship between each X value and a corresponding Y value.
Linear Relationship Remember from algebra: Y = bx + a,
where b = slope and a = the Y-intercept (remember the y - intercept is the value of y when x is 0).– Y = 0.75X + 3
Scatterplot
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6
X values
Y v
alu
es
Least - Squares To determine how well a line fits data
points we have to calculate how much distance there is between the line and the data point. Best fit = smallest error. – Distance = Y - Ŷ (where Ŷ is the predicted
value of Y)– Like SS we need to square these scores to
obtained a uniformly positive measure of error otherwise sum deviations = 0.
– (Y - Ŷ)2
– Best fit = smallest total squared error OR Least squared error
Calculating Regression To find the best fit line:
X Y XY
1
2
3
4
5
15
5
4
3
2
1
15
5
8
9
8
5
35
SP = -10
r = -1.0
b = -10 / 10 = -1
a = 3 - (-1)(3) = 6
Ŷ = -1X + 6
Ŷ = bX + a b = SP / SSx a = MY - bMX
r = SP / SSXSSY
Prediction With Regression The predicted value with the equation will not be
perfect unless r = 1.0 The regression equation should not be used to make
predictions that fall outside the range of the values covered in the original data.
Scatterplot
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6
X values
Y v
alu
es
Drawing a Regression Line Draw a Regression line for Ŷ = -1X + 6
– Pick at least 3 reasonable values for X– Plug into the equation and solve for Y– Plot X, Y points– Connect the dots with a line
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5 6
Standard Error of the Estimate Accuracy depends on how well the line fits
the points (it’s possible to have the same regression line for 2 different data sets and where one has very little error and the other a lot, see page 550 in the book)
Standard error of estimate give the measure of the standard distance between a regression line and the actual data points– SSerror = (Y - Ŷ)2
– Variance = SS / df– Standard error of estimate = variance
Let’s Do OneCalculate a regression line and the
standard error of the estimate for the following data:X
2
4
6
8
10
12
Y
4
6
6
10
10
12
SSerror = (Y - Ŷ)2
Variance = SS / dfStandard error of estimate = variance
Ŷ = bX + a b = SP / SSx a = MY - bMX
r = SP / SSXSSY
top related