lecture 7 regression and correlation
TRANSCRIPT
-
8/11/2019 Lecture 7 Regression and Correlation
1/51
Correlation and
Regression
-
8/11/2019 Lecture 7 Regression and Correlation
2/51
Correlation and Regression
The test you choose depends on level of measurement:
Independent Dependent Test
Dichotomous Interval-Ratio Independent Samples t-testDichotomous
Nominal Interval-Ratio ANOVA
Dichotomous Dichotomous
Nominal Nominal Cross Tabs
Dichotomous Dichotomous
Interval-Ratio Interval-Ratio Bivariate Regression/Correlation
Dichotomous
-
8/11/2019 Lecture 7 Regression and Correlation
3/51
Correlation and Regression
Bivariate regression is a technique that fits a
straight line as close as possible between all the
coordinates of two continuous variables plottedon a two-dimensional graph--to summarize the
relationship between the variables
Correlation is a statistic that assesses thestrength and direction of association of two
continuous variables . . . It is created through a
technique called regression
-
8/11/2019 Lecture 7 Regression and Correlation
4/51
Bivariate Regression
For example:
A criminologist may be interested in the
relationship between Income and Number ofChildren in a family or self-esteem and
criminal behavior.
Independent VariablesFamily Income
Self-esteem
Dependent Variables
Number of Children
Criminal Behavior
-
8/11/2019 Lecture 7 Regression and Correlation
5/51
Bivariate Regression
For example:
Research Hypotheses:
As family income increases, the number of children in
families declines (negative relationship).
As self-esteem increases, reports of criminal behavior
increase (positive relationship).
Independent VariablesFamily Income
Self-esteem
Dependent Variables
Number of Children
Criminal Behavior
-
8/11/2019 Lecture 7 Regression and Correlation
6/51
Bivariate Regression
For example:
Null Hypotheses:
There is no relationship between family income and the
number of children in families. The relationship statistic b = 0.
There is no relationship between self-esteem and criminal
behavior. The relationship statistic b = 0.
Independent VariablesFamily Income
Self-esteem
Dependent Variables
Number of Children
Criminal Behavior
-
8/11/2019 Lecture 7 Regression and Correlation
7/51
Bivariate Regression Lets look at the relationship between self-
esteem and criminal behavior.
Regression starts with plots of coordinates of
variables in a hypothesis (although you will
hardly ever plotyour data in reality).
The data:
Each respondent has filled out a self-esteem
assessment and reported number of crimes
committed.
-
8/11/2019 Lecture 7 Regression and Correlation
8/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
What do you think
the relationship is?
0
1
2
3
4
5
6
7
8
9
10
-
8/11/2019 Lecture 7 Regression and Correlation
9/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
Is it positive?
Negative?No change?
0
1
2
3
4
5
6
7
8
9
10
-
8/11/2019 Lecture 7 Regression and Correlation
10/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
Regression is a procedure that
fits a line to the data. The slope
of that line acts as a model for
the relationship between theplotted variables.
0
1
2
3
4
5
6
7
8
9
10
-
8/11/2019 Lecture 7 Regression and Correlation
11/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
The slope of a line is the change in the correspondingY value for each unit increase in X (rise over run).
Slope = 0, No relationship!
Slope = 0.2, Positive Relationship!
1
Slope = -0.2, Negative Relationship!
1
0.5
0
1
2
3
4
5
6
7
8
9
10
-
8/11/2019 Lecture 7 Regression and Correlation
12/51
Bivariate Regression
The mathematical equation for a line:
Y = mx + b
Where: Y = the lines position on thevertical axis at any point
X = the lines position on the
horizontal axis at any point
m = the slope of the lineb = the intercept with the Y axis,
where X equals zero
-
8/11/2019 Lecture 7 Regression and Correlation
13/51
Bivariate Regression
The statistics equation for a line:
Y = a + bx
Where: Y = the lines position on the
vertical axis at any point (value ofdependent variable)
X = the lines position on the
horizontal axis at any point (value of
the independent variable)
b = the slope of the line (called the coefficient)
a = the intercept with the Y axis,
where X equals zero
^
^
-
8/11/2019 Lecture 7 Regression and Correlation
14/51
Bivariate Regression
The next question:
How do we draw the line???
Our goal for the line:
Fit the line as close as possible to all thedata points for all values of X.
-
8/11/2019 Lecture 7 Regression and Correlation
15/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
How do we minimize the
distance between a line and all
the data points?
0
1
2
3
4
5
6
7
8
9
10
-
8/11/2019 Lecture 7 Regression and Correlation
16/51
Bivariate Regression
How do we minimize the distance between a line and
all the data points?
You already know of a statistic that minimizes thedistance between itself and all data values for a
variable--the mean!
The mean minimizes the sum of squareddeviations--it is where deviations sum to zero and
where the squared deviations are at their lowest
value. (Y - Y-bar)2
-
8/11/2019 Lecture 7 Regression and Correlation
17/51
Bivariate Regression
The mean minimizes the sum of squared
deviations--it is where deviations sum to zero and
where the squared deviations are at their lowestvalue.
Take this principle and fit the line
to the place
where squared deviations (on Y) from the line are
at their lowest value (across all X
s).
(Y - Y)2 Y = line^ ^
-
8/11/2019 Lecture 7 Regression and Correlation
18/51
Bivariate Regression
There are several lines that you could draw where the
deviations would sum to zero...
Minimizing the sum of squared errors gives you the
unique, best fitting line for all the data points. It is the
line that is closest to all points.
Y or Y-hat = Y value for line at any X
Y = case value on variable Y Y - Y = residual
(YY) = 0; therefore, we use (Y - Y)2
and minimize that!
^
^
^ ^
-
8/11/2019 Lecture 7 Regression and Correlation
19/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
0
1
2
3
4
5
6
7
8
9
10
Illustration of Y Y
= Yi, actual Y value corresponding w/ actual X
= Yi, line level on Y corresponding w/ actual X
5
-4
Y = 10, Y = 5
Y = 0, Y = 4
-
8/11/2019 Lecture 7 Regression and Correlation
20/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
0
1
2
3
4
5
6
7
8
9
10
Illustration of (Y Y)2
= Yi, actual Y value corresponding w/ actual X
= Yi, line level on Y corresponding w/ actual X
5
-4
(YiY)2 = deviation2
Y = 10, Y = 5 . . . 25
Y = 0, Y = 4 . . . 16
-
8/11/2019 Lecture 7 Regression and Correlation
21/51
-
8/11/2019 Lecture 7 Regression and Correlation
22/51
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
The fitted line for ourexample has the equation:
Y = - X
, where e= distance fromline to data points or error
If you were to draw any other line, it
would not minimize . . .
(Y - Y)201
2
3
4
5
6
7
8
9
10
Y = a+ bX
e
-
8/11/2019 Lecture 7 Regression and Correlation
23/51
Bivariate Regression
We use (Y - Y)2 and minimize that!
There is a simple, elegant formula for
discovering
the line that minimizes the sum of
squared errors
((X - X)(Y - Y))
b = (X - X)2 a = Y - bX Y = a + bX
This is the method of least squares, it gives our
least squares estimate and indicates why we call
this technique ordinary least squaresor OLS
regression
^
^
-
8/11/2019 Lecture 7 Regression and Correlation
24/51
Bivariate RegressionY
X0 11
2
3
4
5
6
7
8
9
10
Considering that a regression line minimizes (Y - Y)2,where would the regression line cross for an interval-ratio
variable regressed on a dichotomous independent variable?
^
For example:
0=Men: Mean = 6
1=Women: Mean = 4
-
8/11/2019 Lecture 7 Regression and Correlation
25/51
Bivariate RegressionY
X0 11
2
3
4
5
6
7
8
9
10
The difference of means will be the slope.
This is the same number that is tested for
significance in an independent samples t-test.
^
0=Men: Mean = 6
1=Women: Mean = 4
Slope = -2 ; Y = 62X
-
8/11/2019 Lecture 7 Regression and Correlation
26/51
Correlation
This lecture hascovered how to modelthe relationshipbetween twovariables withregression.
Another concept is
strength ofassociation.
Correlation providesthat.
-
8/11/2019 Lecture 7 Regression and Correlation
27/51
Correlation
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
So our equation is:
Y = 6 - .2X
The slope tells us direction ofassociation How strong is
that?
0
1
2
3
4
5
6
7
8
9
10
^
-
8/11/2019 Lecture 7 Regression and Correlation
28/51
Correlation
1
2
3
4
5
6
7
8
9
10
Example of Low Negative Correlation
When there is a lot of difference on the dependent variable across
subjects at particular values of X, there is NOT as much association
(weaker).
Y
X
-
8/11/2019 Lecture 7 Regression and Correlation
29/51
Correlation
1
2
3
4
5
6
7
8
9
10
Example of High Negative Correlation
When there is little difference on the dependent variable across
subjects at particular values of X, there is MORE association
(Stronger).
Y
X
-
8/11/2019 Lecture 7 Regression and Correlation
30/51
Correlation
To find the strength of the relationship
between two variables, we need
correlation. The correlation is the standardized
slope it refers to the standard deviation
change in Y when you go up a standarddeviation in X.
-
8/11/2019 Lecture 7 Regression and Correlation
31/51
Correlation
The correlation is the standardized slope it refers to the standard
deviation change in Y when you go up a standard deviation in X.
(X - X)2
Recall that s.d. of x, Sx = n - 1
(Y - Y)2
and the s.d. of y, Sy = n - 1Sx
Pearson correlation, r = Sy b
-
8/11/2019 Lecture 7 Regression and Correlation
32/51
Correlation
The Pearson Correlation, r:
tells the direction and strength of the
relationship between continuous variablesranges from -1 to +1
is + when the relationship is positive and -
when the relationship is negative the higher the absolute value of r, the stronger
the association
a standard deviation change in x corresponds
with r standard deviation change in Y
-
8/11/2019 Lecture 7 Regression and Correlation
33/51
Correlation
The Pearson Correlation, r:
The pearson correlation is a statistic that is an
inferential statistic too.r - (null = 0)
tn-2= (1-r2) (n-2)
When it is significant, there is a relationship in
the population that is not equal to zero!
-
8/11/2019 Lecture 7 Regression and Correlation
34/51
Error Analysis
Y = a + bX This equation gives the conditional
mean of Y at any given value of X.
So In reality, our line gives us the expected mean of Y given each
value of X
The lines equation tells you how the mean on your dependent
variable changes as your independent variable goes up.
^
Y^
X
Y
-
8/11/2019 Lecture 7 Regression and Correlation
35/51
Error Analysis
As you know, every mean has a distribution around it--so
there is a standard deviation. This is true for conditional
means as well. So, you also have a conditional standard
deviation.
Conditional Standard Deviationor Root Mean Square Error
equals approximate average deviation from the line.
SSE ( Y - Y)2
= n - 2 = n - 2
Y^
X
Y
^
^
-
8/11/2019 Lecture 7 Regression and Correlation
36/51
Error Analysis The Assumption of Homoskedasticity:
The variation around the line is the same no matter the X.
The conditional standard deviation is for any given value of X.
If there is a relationship between X and Y, the conditional standarddeviation is going to be less than the standard deviation of Y--if thisis so, you have improved prediction of the mean value of Y bylooking at each level of X.
If there were no relationship, the conditional standard deviation
would be the same as the original, and the regression line would beflat at the mean of Y.
Y
X
Y Conditionalstandard
deviation
Original
standard
deviation
-
8/11/2019 Lecture 7 Regression and Correlation
37/51
Error Analysis
So guess what?
We have a way to determine how much our
understanding of Y is improved when taking Xinto accountit is based on the fact that
conditional standard deviations should be
smaller than Ys original standard deviation.
http://art.intensify.org/digital_art/fan/tv/question.jpg -
8/11/2019 Lecture 7 Regression and Correlation
38/51
Error Analysis
Proportional Reduction in Error Lets call the variation around the mean in Y Error 1.
Lets call the variation around the line when X is consideredError 2.
But rather than going all the way to standard deviation todetermine error, lets just stop at the basic measure, Sum ofSquared Deviations.
Error 1 (E1) = (YY)2 also called Sum of Squares
Error 2 (E2) = (YY)2 also called Sum of Squared Errors
Y
X
Y Error 2Error 1
-
8/11/2019 Lecture 7 Regression and Correlation
39/51
R-Squared
Proportional Reduction in Error To determine how much taking X into consideration reduces the
variation in Y (at each level of X) we can use a simple formula:
E1E2 Which tells us the proportion orE1 percentage of original error that
is Explained by X.
Error 1 (E1) = (YY)2
Error 2 (E2) = (YY)2
Y
X
Y
Error 2
Error 1
-
8/11/2019 Lecture 7 Regression and Correlation
40/51
R-squared
r2= E1 - E2
E1
= TSS - SSE
TSS
= (YY)2 -(YY)2
(YY)2
r2is called the coefficient of
determination
It is also the square of the
Pearson correlation
Y
X
YError 2
Error 1
-
8/11/2019 Lecture 7 Regression and Correlation
41/51
R-Squared R2:
Is the improvement obtained by using X (and drawing a linethrough the conditional means) in getting as near as possible toeverybodys value for Y over just using the mean for Y alone.
Falls between 0 and 1
Of 1 means an exact fit (and there is no variation of scoresaround the regression line)
Of 0 means no relationship (and as much scatter as in theoriginal Y variable and a flat regression line through the mean ofY)
Would be the same for X regressed on Y as for Y regressed on
X Can be interpreted as the percentage of variability in Y that is
explained by X.
Some people get hung up on maximizing R2, but this is too badbecause any effect is still a findinga small R2 only indicates thatyou havent told the whole (or much of the) story with your variable.
-
8/11/2019 Lecture 7 Regression and Correlation
42/51
Error Analysis, SPSSSome SPSS output (Anti- Gay Marriage regressed on Age):
r
2
(YY)2 -(YY)2
(YY)2
196.8862853.286= .069
Line to the Mean
Data points to the lineData points to
the mean
Original SS
for Anti- Gay
Marriage
-
8/11/2019 Lecture 7 Regression and Correlation
43/51
Error AnalysisSome SPSS output (Anti- Gay Marriage regressed on Age):
r2
(YY)2 -(YY)2
(YY)2
196.8862853.286= .069
Line to the Mean
Data points to the lineData points to
the mean
0 18 45 89 Age
Strong Oppose 5
Oppose 4
Neutral 3
Support 2
Strong Support 1
Anti- GayMarriage
M = 2.98
Colored lines are examples of:Distance from each persons data point
to the line or modelnew, stillunexplained error.
Distance from line or model to Mean
for each personreduction in error.
Distance from each persons data point
to the Meanoriginal variables error.
-
8/11/2019 Lecture 7 Regression and Correlation
44/51
-
8/11/2019 Lecture 7 Regression and Correlation
45/51
Dichotomous VariablesY
X0 1
1
2
3
4
5
6
7
8
9
10
Using a dichotomous independent variable,the ANOVA table in bivariate regression will
have the same numbers and ANOVA results
as a one-way ANOVA table would (and
compare this with an independent samples t-test).
^
0=Men: Mean = 6
1=Women: Mean = 4
Slope = -2 ; Y = 62X
Mean = 5 BSS
WSS
TSS
-
8/11/2019 Lecture 7 Regression and Correlation
46/51
Regression, Inferential Statistics
Descriptive:
The equation for your line
is a descriptive statistic.
It tells you the real, best-
fitted line that minimizessquared errors.
Inferential:
But what about the
population? What can we
say about the relationship
between your variables inthe population???
The inferential statistics
are estimates based on
the best-fitted line.
Recall that statistics are divided between descriptive
and inferential statistics.
-
8/11/2019 Lecture 7 Regression and Correlation
47/51
Regression, Inferential Statistics
The significance of F, you already understand.
The ratio of Regression (line to the mean of Y) to Residual (line todata point) Sums of Squares forms an F ratio in repeated sampling.
Null: r2= 0 in the population. If F exceeds critical F, then your
variables have a relationship in the population (X explains some of
the variation in Y).
Most extreme
5% of Fs
F = Regression SS / Residual SS
-
8/11/2019 Lecture 7 Regression and Correlation
48/51
Regression, Inferential Statistics What about the Slope or
Coefficient?
From sample to sample, different
slopes would be obtained.
The slope has a sampling
distribution that is normally
distributed.
So we can do a significance test. -3 -2 -1 0 1 2 3z
-
8/11/2019 Lecture 7 Regression and Correlation
49/51
Regression, Inferential StatisticsConducting a Test of Significance for the slope of the Regression Line
By slapping the sampling distribution for the slope over a guess of thepopulations slope, Ho, one determines whether a sample couldhave been drawn from a population where the slope is equal Ho.
1. Two-tailed significance test for -level = .05
2. Critical t = +/- 1.96
3. To find if there is a significant slope in the population,
Ho: = 0
Ha: 0 ( YY )2
4. Collect Data n - 25. Calculate t (z): t = bo s.e. =
s.e. ( XX )2
6. Make decision about the null hypothesis
7. Find P-value
-
8/11/2019 Lecture 7 Regression and Correlation
50/51
Correlation and RegressionBack to the SPSS output:
The standard error and tappears on SPSS output
and the p-value too!
-
8/11/2019 Lecture 7 Regression and Correlation
51/51
Correlation and RegressionBack to the SPSS output:
Y = 1.88 + .023X
So the GSS example, the
slope is significant. There is
evidence of a positive
relationship in the
population between Age and
Anti- Gay Marriage
sentiment. 6.9% of the
variation in Marriage
attitude is explained by age.
The older Americans get, the
more likely they are to
oppose gay marriage.
A one year increase in age elevates antiattitudes by .023 scale units. There is a weak