linear regression. psyc 6130, prof. j. elder 2 correlation vs regression: what’s the difference?...
TRANSCRIPT
Linear Regression
PSYC 6130, PROF. J. ELDER 2
Correlation vs Regression: What’s the Difference?
• Correlation measures how strongly related 2 variables are.
• Regression provides a means for predicting the value of one variable based on the value of a related variable.
• The underlying mathematics are the same.
• Here we are dealing only with linear correlation and linear regression.
PSYC 6130, PROF. J. ELDER 3
Optimal Prediction using z Scores
• Consider 2 variables X and Y that may be related in some way.
– e.g.,
• X = midterm score, Y = final exam score
• X = reaction time, Y = error rate
• Suppose you know X for a particular case (e.g., individual, trial). What is your best guess at Y?
• The answer turns out to be pretty simple:
Y Xz rz
PSYC 6130, PROF. J. ELDER 4
Example: 6130A 2005-06 Assignment marksAssignment 1 Assignment 2
X Y86.7% 81.8%81.5% 82.4%85.0% 84.3%85.5% 86.8%90.2% 83.6%95.4% 87.4%91.9% 93.1%93.1% 93.1%94.8% 91.8%93.6% 93.7%94.8% 93.1%94.2% 94.3%94.8% 95.6%
Mean 90.9% 89.3%Sample Std. Dev. 4.66% 5.04%
PSYC 6130, PROF. J. ELDER 5
Graphical Representation
0.7998Y Xz z
Regression line
PSYC 6130A 2005-06
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Assignment 1 z-Score
Ass
ignm
ent
2 z-
Sco
re
PSYC 6130, PROF. J. ELDER 6
The Raw-Score Regression Formula
YX YXY a b X
( )YY X
X
Y r X
YYX
X
b r
YX Y YX Xa b
or
where
In terms of population parameters: In terms of sample statistics:
YX YXY a b X
( )Y
X
sY Y r X X
s
YYX
X
sb r
s
YX YXa Y b X
or
where
PSYC 6130, PROF. J. ELDER 7
Example: 6130A 2005-06 Assignment marksAssignment 1 Assignment 2
X Y86.7% 81.8%81.5% 82.4%85.0% 84.3%85.5% 86.8%90.2% 83.6%95.4% 87.4%91.9% 93.1%93.1% 93.1%94.8% 91.8%93.6% 93.7%94.8% 93.1%94.2% 94.3%94.8% 95.6%
Mean 90.9% 89.3%Sample Std. Dev. 4.66% 5.04%
PSYC 6130, PROF. J. ELDER 8
Graphical Representation
PSYC 6130 Section A 2005-2006
75%
80%
85%
90%
95%
100%
80% 85% 90% 95% 100%
Assignment 1 Grade
Ass
ign
me
nt 2
Gra
de0.867
10.5%YX
YX
b
a
y = 0.867x + 10.5%
Regression line
PSYC 6130, PROF. J. ELDER 9
Residuals• The deviations of the actual Y values from the Y values predicted by
the regression line are called residuals.
• The regression line minimizes the sum of squared residuals (and hence is called a mean-squared fit).
PSYC 6130 Section A 2005-2006
75%
80%
85%
90%
95%
100%
80% 85% 90% 95% 100%
Assignment 1 Grade
Ass
ign
me
nt 2
Gra
de
Y
Yresidual Y Y
PSYC 6130, PROF. J. ELDER 10
Variance of the Estimate
• Total prediction error is expressed as the variance of the estimate (or mean-squared error) :
22est Y
( )Y Y
N
2 2est YNote that .Y
Equality applies only when 0.r
2
2est Y
( )
2
Y Ys
N
In terms of population parameters: In terms of sample statistics:
2est Y
est Y est Y ( ) standard error of is calle the estid mh .t e ates
PSYC 6130, PROF. J. ELDER 11
Explained and Unexplained Variance
2 2exp
1Explained Variance: ( )
N YY
PSYC 6130 Section A 2005-2006
75%
80%
85%
90%
95%
100%
80% 85% 90% 95% 100%
Assignment 1 Grade
Ass
ign
me
nt 2
Gra
de Y
Y
Y
Unexplained
Explained
2 2est
1Unexplained Variance ( )
Ny Y Y
PSYC 6130, PROF. J. ELDER 12
Summary of Variances
22exp
( )Explained Variance: YY
N
22
( )Unexplained Variance est Y
Y Y
N
Population:2
2 ( )Total Variance Y
Y
Y
N
PSYC 6130, PROF. J. ELDER 13
Summary of Variances
• It can be shown that:
• i.e., the variance is equal to the sum of the explained and unexplained variances.
Population:
2 2 2exp Y estY
PSYC 6130, PROF. J. ELDER 14
Summary of Variances
Sample:
2 2 2expExplained Variance: Y estYs s s
22
( )Unexplained Variance
2est Y
Y Ys
N
22 ( )
Total Variance s 1Y
Y Y
N
PSYC 6130, PROF. J. ELDER 15
Coefficient of Determination• The fraction of the total variance explained by the regression line is
called the coefficient of determination
• It can be shown that this is just the square of the Pearson coefficient r:
• Population:
• Sample:
2 22
2 2
( )Coefficient of Determination 1
( )Y estY
Y Y
Yr
Y
2 22
2 2
( ) 2Coefficient of Determination 1
( ) 1estY
Y
Y Y snr
Y Y n s
PSYC 6130, PROF. J. ELDER 16
Coefficient of Nondetermination• The fraction of the total variance that remains unexplained by the
regression line is called the coefficient of nondetermination
• It can be shown that this is just 1-r2:
• Population:
• Sample:
2 22
2 2
( )Coefficient of Nondetermination 1-
( )estY
Y Y
Y Yr
Y
2 22
2 2
( ) 2Coefficient of Nondetermination 1-
( ) 1estY
Y
Y Y snr
Y Y n s
PSYC 6130, PROF. J. ELDER 17
Summary of Coefficients
2 22
2 2
Coefficient of Determination:
( )r 1
( )Y estY
Y Y
Y
Y
Population: Sample:
2 22 est Y
2 2
Coefficient of Nondetermination:
( )1-r
( )Y Y
Y Y
Y
2 22
2 2
Coefficient of Determination:
( ) 2r 1
( ) 1estY
Y
Y Y sn
Y Y n s
2 22 est Y
2 2
Coefficient of Nondetermination:
( ) 21-r
( ) 1 Y
Y Y sn
Y Y n s
PSYC 6130, PROF. J. ELDER 18
Components of Variance: SPSS Output
ANOVA b
861347.2 1 861347.186 7465.139 .000 a
1325861 11491 115.383
2187209 11492
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), How tall are you without your shoes on (in cm.)a.
Dependent Variable: How much do you weigh (in kilograms)b.
2Explained SS: ( )Y Y
2Unexplained SS: ( )Y Y 2Total SS: ( )Y Y
22
( )Unexplained Variance
2est Y
Y Ys
N
PSYC 6130, PROF. J. ELDER 19
Estimating the Variance of the Estimate
• Uncertainty in predictions can be estimated using the assumption of homoscedasticity.
– (Etymology: hom- + Greek skedastikos able to disperse, from skedannynai to disperse)
– Thought question: does this also explain the origin of the verb skedaddle?
– In other words, homogeneity of variance in Y over the range of X.
PSYC 6130, PROF. J. ELDER 20
Confidence Intervals for Predictions
2
2
1 ( )1
( 1)crit estYX
X XY Y t s
N N s
PSYC 6130, PROF. J. ELDER 21
Example: 6130A 2005-06 Assignment marksAssignment 1 Assignment 2
X Y86.7% 81.8%81.5% 82.4%85.0% 84.3%85.5% 86.8%90.2% 83.6%95.4% 87.4%91.9% 93.1%93.1% 93.1%94.8% 91.8%93.6% 93.7%94.8% 93.1%94.2% 94.3%94.8% 95.6%
Mean 90.9% 89.3%Sample Std. Dev. 4.66% 5.04%
0.7998r
PSYC 6130, PROF. J. ELDER 22
Underlying Assumptions
• Independent random sampling
• Linearity
• Normal Distribution
• Homoscedasticity
PSYC 6130, PROF. J. ELDER 23
Regressing X on Y• Simply reverse the formulae, e.g.,
In terms of sample statistics:
XY XYX a b Y
( )X
Y
sX X r Y Y
s
XXY
Y
sb r
s
XY XYa X b Y
or
where
PSYC 6130, PROF. J. ELDER 24
When to Use Linear Regression
• Prediction
• Statistical Control
– Adjust for effects of confounding variable.
– Also known as partialing out the effect of the confounding variable.
• Experimental Psychology: modeling effect of continuous independent variable on continuous dependent variable.
– e.g., reaction time vs set size in visual search.
PSYC 6130, PROF. J. ELDER 25
Statistical Control Example: Mental Health
Women report more bad mental health days than men, t(8176)=-7.1, p<.001, 2-tailed.
PSYC 6130, PROF. J. ELDER 26
Statistical Control Example: Physical Health
PSYC 6130, PROF. J. ELDER 27
Correlation
Pearson’s r = 0.31
PSYC 6130, PROF. J. ELDER 28
After Partialing Out Physical Health
PSYC 6130, PROF. J. ELDER 29
Result of Partialing Out Physical Health
Controlling for physical health, women report more bad mental health days than men, t(8176)=-5.7, p<.001, 2-tailed.