l10-edu5950 simple regression analysis
TRANSCRIPT
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
1/24
1
EDU5950
SEM2 2010-11
CORRELATION &SIMPLE REGRESSION
Correlation - Test of association
! A correlation measures the “degree of association” between
two variables (interval or ordinal)
! Associations can be positive (an increase in one variable is
associated with an increase in the other) or negative (an
increase in one variable is associated with a decrease in the
other)! Correlation is measured in “r” (parametric, Pearson’s) or
“!” (non-parametric, Spearman’s)
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
2/24
2
Test of association - Correlation
! Compare two continuous variables in terms of degree ofassociation
" e.g. attitude scale vs behavioural frequency
0
50
100
150
200
250
300
0 50 100 150 200 250 300
0
50
100
150
200
250
0 50 100 150 200 250
Positive Negative
Test of association - Correlation
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
0
20
40
60
80
100
120
140
160
0 50 100 150 200 250
! Test statistic is “r” (parametric) or “ ” (non-parametric)
0 (random distribution, zero correlation)
1 (perfect correlation)
High Low
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
3/24
3
Test of association - Correlation
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250
! Test statistic is “r” (parametric) or “ ” (non-parametric)
0 (random distribution, zero correlation)
1 (perfect correlation)
High Zero
6
Regression & Correlation!
A correlation measures the “degree of
association” between two variables (interval
(50,100,150…) or ordinal (1,2,3...))
!
Associations can be positive (an increase in onevariable is associated with an increase in the
other) or negative (an increase in one variable is
associated with a decrease in the other)
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
4/24
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
5/24
5
! “Best fit line”
! Allows us to describe
relationship between variables
more accurately.
! We can now predict specific
values of one variable from
knowledge of the other
! All points are close to the line
Graph Three: Relationship between
Symptom Index and Drug A
(with best-fit line)
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
Drug A (dose in mg)
S y m p t o m I
n d e x
Example: Symptom Index vs Drug A
Graph Four: Relationship between Symptom
Index and Drug B
(with best-fit line)
0
20
40
60
80
100
120
140
160
0 50 100 150 200 250
Drug B (dose in mg)
S y m p t o m I
n d
e x
! We can still predict specific
values of one variable from
knowledge of the other
! Will predictions be as accurate?
! Why not?
! “Residuals”
Example: Symptom Index vs Drug B
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
6/24
6
11
Correlation examples
" Regression analysis procedures have as their
primary purpose the development of an
equation that can be used for predicting values
on some DV for all members of a population.
" A secondary purpose is to use regressionanalysis as a means of explaining causal
relationships among variables.
Regression
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
7/24
7
" The most basic application of regression analysis is the
bivariate situation, to which is referred as simple linear
regression, or just simple regression.
" Simple regression involves a single IV and a single DV.
" Goal: to obtain a linear equation so that we can predict the
value of the DV if we have the value of the IV.
" Simple regression capitalizes on the correlation between the
DV and IV in order to make specific predictions about theDV.
" The correlation tells us how much information about the
DV is contained in the IV.
" If the correlation is perfect (i.e r = ±1.00), the IV contains
everything we need to know about the DV, and we will
be able to perfectly predict one from the other.
" Regression analysis is the means by which we determine
the best-fitting line, called the regression line.
" Regression line is the straight line that lies closest to all
points in a given scatterplot
" This line sometimes pass through the centroid of the
scatterplot.
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
8/24
8
" 3 important facts about the regression line must beknown:
" The extent to which points are scattered around the line
"
The slope of the regression line
" The point at which the line crosses the Y-axis
" The extent to which the points are scattered around theline is typically indicated by the degree of relationship
between the IV (X) and DV (Y)."
This relationship is measured by a correlationcoefficient – the stronger the relationship, the higher thedegree of predictability between X and Y.
" The degree of slope is determined by the amount
of change in Y that accompanies a unit change in
X.
" It is the slope that largely determines the predicted
values of Y from known values for X.
" It is important to determine exactly where the
regression line crosses the Y-axis (this value is
known as the Y-intercept).
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
9/24
9
" The regression line is essentially an equation that
express Y as a function of X.
"
The basic equation for simple regression is:
" Y = a + b X
" where Y is the predicted value for the DV,
" X is the known raw score value on the IV,
" b is the slope of the regression line
" a is the Y-intercept
Simple Linear Regression
! Purpose
" determine relationship between two metric variables
" predict value of the dependent variable (Y ) based on
value of independent variable ( X )
! Requirement :
" DV Interval / Ratio
" IV Internal / Ratio
! Requirement :
" The independent and dependent variables are normally
distributed in the population
" The cases represents a random sample from the population
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
10/24
10
Simple Regression How best to summarise the data?
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
Drug A (dose in mg)
S y m p t o m I
n d e x
0
20
40
60
80
100
120
140
160
0 50 100 150 200 250
Drug A (dose in mg)
S y m p t o m I
n d e x
Adding a best-fit line allows us to describe data simply
"
Establish equation for the best-fit line:
Y = a + b X
General Linear Model (GLM) How best to summarise the data?
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250
Where: a = y intercept (constant)b = slope of best-fit line
Y = dependent variable
X = independent variable
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
11/24
11
"
For simple regression, R 2 is the square of the correlationcoefficient
"
Reflects variance accounted for in data by the best-fit line
" Takes values between 0 (0%) and 1 (100%)
"
Frequently expressed as percentage, rather than decimal
" High values show good fit, low values show poor fit
Simple Regression R2 - “Goodness of fit”
" R 2 = 0
"
(0% - randomly scattered
points, no apparent
relationship between Xand Y)
" Implies that a best-fit line
will be a very poor
description of data0
50
100
150
200
250
300
0 100 200 300
IV (regressor, predictor)
DV
Simple Regression Low values of R2
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
12/24
12
" R 2 = 1
" (100% - points lie directly
on the line - perfect
relationship between X
and Y)"
Implies that a best-fit line
will be a very good
description of data
0
50
100
150
200
250
300
0 100 200 300
IV
D V
0
50
100
150
200
250
0 50 100 150 200 250
IV
D V
Simple Regression High values of R2
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
Drug A (dose in mg)
S y m p t o m I n d e x
0
20
40
60
80
100
120
140
160
0 50 100 150 200 250
Drug B (dose in mg)
S
y m p t o m I
n d e x
Good fit ! R2 high
High variance explained
Moderate fit ! R2 lower
Less variance explained
Simple Regression R2 - “Goodness of fit”
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
13/24
13
25
Problem: to draw a straight line through the points
that best explains the variance
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Line can then be used to
predict Y from X
26
" “Best fit line”
" allows us to describe relationship
between variables more accurately.
" We can now predict specific values
of one variable from knowledge of
the other
" All points are close to the line
Graph Three: Relationship between
Symptom Index and Drug A
(with best-fit line)
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
Drug A (dose in mg)
S y m p t o m I n
d e x
Example: Symptom Index vs Drug A
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
14/24
14
27
" Establish equation for the best-fit line:
Y = a + b X
! Best-fit line same as regression line
! b is the regression coefficient for x
! x is the predictor or regressor variable for y
Regression
Regression - Types
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
15/24
15
Linear Regression - Model
ii iY X ! ! "
0 1+ +=
Regression Coefficients
Population
Sample
Y = a + b X ˆ
Constant
30
Parameters
# The population parameters andare simple the least squares estimatescomputed on all the members of thepopulation, not just the sample
# Population parameters:# Sample statistics: a and b
01
0 1and
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
16/24
16
31
Inference About the PopulationSlope and Intercept
# If then we have a graph like this:
0 1 Y = X
1 0
X
0 1 X
32
Inference About the PopulationSlope and Intercept
# If then we have a graph like this:
0 1 Y = X
1 0
X
0 1 X
This is the meanof Y for those
whoseindependentvariable is X
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
17/24
17
Copyright (c) Bani K. Mallick 33
Inference About the PopulationSlope and Intercept
# If then we have a graph like this:
0 1 Y = X
1 0
X
0 1 X Note how the mean
of Y does not dependon X: Y and X are
independent
34
Linear Regression and Correlation
# If then Y and X are independent
# So, we can test the null hypothesis
# that Y and X are independent by testing
# The p-value in regression tables tests this
hypothesis
0 1 Y = X
1 0
0H :
0 1H : 0
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
18/24
18
X Y
Temperature Sales
63 1.52
70 1.68
73 1.8
75 2.05
80 2.36
82 2.25
85 2.68
88 2.9
90 3.14
91 3.06
92 3.24
75 1.92
98 3.4
100 3.28
92 3.17
87 2.83
84 2.58
88 2.86
80 2.26
82 2.14
76 1.98
Ice Cream Example
Ice Cream Sales
0
0.5
1
1.5
2
2.5
3
3.5
4
0 20 40 60 80 100 120
Simple Regression Lineˆ Y = a + b X
TWO STEPS TO SIMPLE LINEAR REGRESSION
Descriptive
InferentialHypothesis Test :
1 Regression Model
2 Slope
# Regression equation : ! = a + b X
# Correlation coefficient (r)
# Coefficient of Determination (r $)
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
19/24
19
First Step Descriptive
Derive Regression / Prediction equation
# Calculate a and b
a = y – b X
! = a + b X
Example1 :
Data were collected from a randomly
selected sample to determine relationship
between average assignment scores and test
scores in statistics. Distribution for
the data is presented in the table below.
1. Calculate coefficient of determination
and the correlation coefficient
2. Determine the prediction equation.
3. Test hypothesis for the slope at 0.05 level
of significance
Data set:
Scores
ID Assign Test
1 8.5 88
2 6 66
3 9 94
4 10 98
5 8 87
6 7 72
7 5 45
8 6 63
9 7.5 85
10 5 77
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
20/24
20
1.
Derive Regression / Prediction equation
215.5
26.1= 8.257=
a= y – b x= 77.5 – 8.257 (7.2)
= 18.050
ID X Y1 8.5 88
2 6 66
3 9 94
4 10 98
5 8 87
6 7 72
7 5 45
8 6 63
9 7.5 85
10 5 77
Summary stat:
n 10
%& 72
%' 775
%&$ 544.5
%'$ 62,441
%&' 5,795.5
Prediction equation:
! = 18.05 + 8.257X
Interpretation of regression equation
! = 18.05 + 8.257x
For every 1 unit change in X,
Y will change by 8.257 units
(X
(Y 8. 2 5 7
18.05
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
21/24
21
MARITAL SATISFACTION
Parents : X Children : Y
1 3
3 2
7 6
9 7
8 8
4 6
5 3 Mean of X Mean of Y
No of pairs
" X " Y
" X squared " X squared
Standard deviation Standard deviation
" XY
Example 2:
1.
Derive Regression / Prediction equation
a= y – b x
= 5.00 +.65 (5.29)
= 8.438
Prediction equation:
! = 8.44 + 65x
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
22/24
22
Interpretation of regression equation
! = 8.43 + .65x
For every 1 unit change in X,
Y will change by .65 units
(X
(Y 0. 6 5
8.43
Descriptive Statistics
Mean Std. Deviation N
Grade - PMR MATH 2.53 1.468 62
TEACHER_FACTOR 3.9643 .91443 62
Correlations
Grade - PMR
MATH
TEACHER_F
ACTOR
Pearson Correlation Grade - PMR MATH 1.000 .571
TEACHER_FACTO
R
.571 1.000
Sig. (1-tailed) Grade - PMR MATH . .000
TEACHER_FACTO
R
.000 .
N Grade - PMR MATH 62 62
TEACHER_FACTO
R
62 62
Model Summaryb
Model
R R Square
Adjusted R
Square
Std. Error of the
Estimate
di
m
e
n
si
o
n
0
1
.571a
.326
.315
1.215
a. Predictors: (Constant), TEACHER_FACTOR
b. Dependent Variable: Grade - PMR MATH
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
23/24
23
ANOVAb
Model Sum of
Squares df
Mean
Square F
Sig.
1 Regression 42.848 1 42.848 29.021 .000a
Residual 88.588 60 1.476
Total 131.435 61
a. Predictors: (Constant), TEACHER_FACTOR
b. Dependent Variable: Grade - PMR MATH
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t
Sig.
B
Std. Error
Beta
1 (Constant) -1.101 .692 -1.591 .117
TEACHER_FACTOR .917 .170 .571 5.387 .000
a. Dependent Variable: Grade - PMR MATH
Descriptive Statistics
Mean
Std. Deviation
N
Grade - PMR MATH
2.53
1.468
62
TEACHER_FACTOR
3.9643
.91443
62
Race
1.90
.593
62
Correlations
Grade -PMR MATH
TEACHER _FACTOR Race
Pearson
Correlation
Grade - PMRMATH
1.000 .571 -.015
TEACHER_FACTOR
.571 1.000 .019
Race
-.015
.019
1.000
Sig. (1-tailed) Grade - PMRMATH
. .000 .453
TEACHER_FACTOR
.000 . .440
Race .453 .440 .
N
Grade - PMRMATH
62 62 62
TEACHER_FACTOR
62 62 62
Race 62 62 62
Model Summaryb
Model
R R Square
Adjusted R
Square
Std. Error of
the Estimate
d
i
m
e
n
s
i
o
n
0
1 .572a .327 .304 1.225
a. Predictors: (Constant), Race, TEACHER_FACTOR
b. Dependent Variable: Grade - PMR MATH
-
8/17/2019 l10-Edu5950 Simple Regression Analysis
24/24
24
ANOVAb
Model Sum of
Squares df
Mean
Square F
Sig.
1
Regression
42.939
2
21.469
14.313
.000a
Residual
88.497
59
1.500
Total
131.435
61
a. Predictors: (Constant), Race, TEACHER_FACTOR
b. Dependent Variable: Grade - PMR MATH
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t
Sig.
B
Std. Error
Beta 1
(Constant)
-.980
.853
-1.150
.255
TEACHER_FACTOR
.917
.172
.571
5.349
.000
Race -.065 .265 -.026 -.246 .806
a. Dependent Variable: Grade - PMR MATH