1 chapter 10 linear regression and correlation relationship between variables
Post on 19-Dec-2015
237 views
TRANSCRIPT
1
Chapter 10
Linear regression and correlation
Relationship between variables
2
Relationship between variables
Age and blood pressure
Nutrient level and growth of cells
Height and weight
To determine the strength of relationship between two variables and to test if it is
statistically significant
3
Two samples t test vs regression
group 1 group 2
22 18
21 15
27 13
34 20
30 32
Observe group
22 1
21 1
27 1
34 1
30 1
18 2
15 2
13 2
20 2
32 2
4
Difference, variation and association analysis
Relationship Variable Y Variable X
Two sample t test (quantity)
One way ANOVA (quantity)
Regression and correlation
(quantity)
Group(0,1)
(category)
Group(ABC)
(category)
(quantity)
5
Sir Francis Galton (16 February 1822 – 17 January 1911)
Polymath:Meteorology (the anti-cyclone and the first popular weather maps);Psychology (synaesthesia);Biology (the nature and mechanism of heredity);Eugenicist; Criminology (fingerprints); Statistics (regression and correlation).
6
7
8
Related but Different
Regression analysis:
one of the variables (e.g. blood pressure) is dependent on (caused by) the other which are fixed and measured without error (e.g. age).
Correlation analysis:
both variables are experimental and measured with error (e.g height and weight).
9
Regression analysis
Recording
Number
X
temperature
(◦Celsius)
Y
heart rate
(beast/minute)
1 2 5
2 4 11
3 6 11
4 8 14
5 10 22
6 12 23
7 14 32
8 16 29
9 18 32
The experimental data
Repeated experiments
10
Correlation analysis
Animal
X
Length
(cm)
Y
Width
(cm)
1 10.7 5.8
2 11.0 6.0
3 9.5 5.0
4 11.1 6.0
5 10.3 5.3
6 10.7 5.8
7 9.9 5.2
8 10.6 5.7
9 10.0 5.3
10 12.0 6.3
The experimental data
More individuals measured
11
Equation for a straight line
If you know a and b, can predict Y from X
----the goal of regression
Y a bX
Regression analysis
12
Regression vs correlation
13
Concepts
Simple linear regression
Simple linear correlation
Correlation analysis based on ranks
14
Example
Consider growth rate of a yeast colony and nutrient level .
If you increase nutrient level, the growth rate would increase.
Growth rate is dependent on nutrient level but nutrient level is NOT dependent on growth rate.
15
Growth rate is called the Dependent Variable and is given the symbol Y.
Nutrient level (the causal factor) is called the Independent Variable and is given the symbol X.
Variables in Regression
Nutrient level X
Grow
th rate Y
16
Single linear model assumption
a) X’s are fixed and measured without error
b)
c)
d) Homoscedastic
|( ) Y XE Y X
2| ~ . . . (0, )i ii i i Y X i iY X or and i i d N
α,β:constant real numbers, β ≠0
independent identically distributed
17
18
General steps for simple linear regression analysis
① Graphing the data
② Fitting the best straight line
③ Testing whether the linear relationship is
statistically significant or not
19
① Graphing the data
② Fitting the best straight line
No relationship
Relationship but not straight-lined
Negative linear relationship
Positive linear relationship
Which one?
Need criterion
20
Example: Area of a yeast colony on successive days.
Are
a (y
)
Time days (x)
H
L
Slope (b) = H/L
a
Intercept (at x=0)
00
The best fit
21
ProblemA
rea
(y)
Time days (x)
How to estimate a and b?
22
0
Method
y
x0
Y Y( , )i iX Y
2( )Total iSS Y Y Total sum of squares for Y:
( ) YE Y Fitting to the dataY Y
23
MethodA
rea
(y)
Time days (x)
Y a bX
( , )i iX Y
Residual error sum of squares: 2ˆ( )E i iiSS Y Y
Fitting to the dataY a bX
a and b should minimize the residual error
24
Method
ˆ ˆ( ) ( )i i i iY Y Y Y Y Y ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y
25
ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y
2 2ˆ ˆ( ) [( ) ( )]i i i iY Y Y Y Y Y 2 2ˆ ˆ ˆ ˆ( ) 2 ( )( ) ( )i i i i i iY Y Y Y Y Y Y Y
=0
2 22 ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y Sum of Squares
Total
SSTotal
Sum of Squaresdue to regression
SSR
Sum of SquaresResidual or error
SSE
maximize minimize
26
Least Square Regression Equation
• Minimize SSError by partial derivatives
0 2 { [ ' ( )]} 0'
' ( ) 0
' 0
'
Errori i
i i
i
i
SSY a b X X
a
Y a b X X
Y na
Ya Y
n
2 2ˆ( ) { [ ' ( )]}Error i i i iSS Y Y Y a b X X
ˆ' ( ) ; ' ( )i i i i iY a b X X Y a b X X
=0
27
Least Square Regression Equation
2
0 2{ [ ' ( )]}[ ( )] 0
( ) '( ) ( ) 0
Errori i i
i i i i
SSY a b X X X X
b
Y X X a X X b X X
=0
2
2
2 2
( ) ( )
( )
( )
( )( ) /
( ) /
i i i
i i
i
i i i i XY
i i X
Y X X b X X
Y X Xb
X X
X Y X Y n SSb
X X n SS
28
Result
Least squares regression line
22
( )( )
( )XY
X
X YXY SSnb
X SSX
n
a Y bX
|ˆ ˆ
Y XY a bX |ˆ ˆ ' ( X)Y XY a b X
'a Y
29
③ Simple Linear Regression Analysis
A global test for regression (ANOVA)
A test for regression coefficient (Student
t test)
30
Hypothesis
H0: The variation in Y is not explained by
a linear model, i.e., β=0
Ha: A significant portion of the variation in
Y is explained by a linear model i.e., β≠0
31
Partitioning the Sum of Squares
2 2ˆ( ) ( ( ) )R i iSS Y Y Y b X X Y 2 22 2 2( ) ( )i i Xb X X b X X b SS
XY
X
SSb
SS
22 ( )
( )XY XYX XY
X X
SS SSSS bSS
SS SS
22 2 ( )
( ) iTotal i i
YSS Y Y Y
n
2ˆ( )E i i Total RSS Y Y SS SS
32
n-1SSTotalTotal
MSEn-2SSE
Error
See Table C.7MSR1SSRRegression
c.v.FE(MS)MSDFSSSource of variation
2 2Y XSS R
E
MS
MS
2Y
The ANOVA table for a regression analysis
If H0 is true
2 2
2( ) R Y X
E Y
EMS SSE F
EMS
2 2
2( ) 1Y X
Y
SSE F
If Ha is true,β=0
=1
Test statistic: (1, 2)R
E
MSF F n
MS
33
Coefficient of determination
a measure of the amount of the variability in Y that is explained by its dependence on X.
Coeff of D. = R
Total
SS
SS
34
Simple Linear Regression Analysis
A global test for regression (ANOVA)
A test for regression coefficient (Student
t test)
35
Hypothesis
H0: The variation in Y is not explained by
a linear model, i.e., β=0
Ha: A significant portion of the variation in
Y is explained by a linear model i.e., β≠0
36
t test statistic
Variance of b:
It’s estimate:
Standard error of b:
~ ( 2)b
bt t ns
2 21b Y
XSS
2 1( )
2E E
bX X
SS MSs
SS n SS
Eb
X
MSs
SS
37
F(1,n-2) = t2(n-2)
Eb
X
MSs
SS
2
21 1 / 2
R X
E E
X
b E
MS b SSANOVA F
MS MS
b SSbStudent t
s MS
F t
38
follow student’s t distribution
Confidence interval
Confidence interval for β
1 12 2
( ) 1b
bP t t
s
1 12 2
( ) 1 , 2b bC b t s b t s with df n
1 21 1
2 2
,b bL b t s L b t s
, 2b
bt df n
s
39
Since
And
Standard error of is:
|ˆ ˆ ( )i Y X iY Y b X X
2 2E b E XYs MS n s MS SS
Confidence Interval for |Y X
22
ˆ 2
( )1( ) [ ]
2 ( )i
iE E EiY
X i
X XMS MS SSs X X
n SS n n X X
|ˆY X
Sampling error
40
Confidence Interval for |Y X
follow student’s t distribution
Confidence interval
ˆ ˆ| | |1 1
2 2
ˆ ˆ( ) 1 ,
2
Y X Y X Y XY YC t s t s
with df n
ˆ
| | , 2ˆ
Y
Y X Y Xt df nS
L1 L2
41
Since
And
Standard error of is:
( )i i iY Y b X X
2 2E b E XYs MS n s MS SS
Confidence Interval for iY
22
2
( )1( ) [1 ]
2 ( )i
iE E EY E i
X i
X XMS MS SSs MS X X
n SS n n X X
iY
Sampling error
42
Understand the regression
analysis via example
43
Example1: Yield of tomato varieties
Variety 1 Variety 2
22 18
21 15
27 13
34 20
30 32
28 27
21 11
29 20
22 16
14 17
248 189Totals:
Variety
1
Variety
2
Mean 24.8 18.9
St. Dev. 5.8271 6.3675
Variance 33.96 40.54
Nos. of
observations 10 10
Summarized data:
44
A. Student’s t test
There is no difference between the two variances
There is difference between the two mean
22
21
33.96 0.84, 0.79640.54sF ps
Accept H0
0.045p Reject H0
2 40.54 33.9637.3
2ps
1 2
(24.8 18.9) (0 0) 5.92.16 2 18
2.731 137.3 ( )
10 10
t with n n
0 1 2:H
45
B. ANOVA
Item df SS MS F(1,18) P
Between 1 174.05 174.05 4.67** <0.01
Within 18 670.5 37.25
Total 19 844.55
Conclusion: Reject H0
0 1 2:H
46
Compare ANOVA with t test
t was 2.16 for 18df, 0.05 P 0.01
F was 4.67 for 1 and 18 df, 0.05 P 0.01
In fact, F= t2 (i.e. 4.67=2.162)
Why?
Because with t we are dealing with differences
while with F we are dealing with variances
(differences squared)
47
C. RegressionObserve Variety
22 1
21 1
27 1
34 1
30 1
18 2
15 2
13 2
20 2
32
.
2
.
Variety 1 Variety 2
22 18
21 15
27 13
34 20
30 32
28 27
21 11
29 20
22 16
14 17
48
Calculations20 30 437n X Y
2 22 ( ) 30
50 520X
XSS X
n
2 2 2
2
10 1 10 2 50
10393
626
X
Y
XY
2 22 ( ) 437
10393 844.5520Y
YSS Y
n
30 437626 29.5
20XY
X YSS XY
n
49
Regression coefficient
Intercept atx
Intercept
regression equation
Estimation
29.55.9
5XY
X
SSb
SS
5, 844.55, 29.5X Y XYSS SS SS
30.7 5.9Y X
21.85 5.9 1.5 30.7a Y b X
21.85a Y
21.85 5.9( 1.5)Y X
50
Testing the significance ANOVA
H0: no linear relation between y and x. β=0
Ha: the variation in y is linearly explained by
the variation in x. i.e., β≠0
1
2
|
( | 1)
( | 2)
( )
2
Y X
E Y X
E Y X
E Y X
2 1
2 1
0 :
0 :
0H
H
51
ANOVA SS
5, 844.55, 29.5X Y XYSS SS SS
2 2
844.55
( ) ( 29.5)174.05
5
670.5
Total Y
XYR
X
E Total R
SS SS
SSSS
SS
SS SS SS
52
Regression ANOVA
Item df SS MS F P
Regression 1 174.05 174.05 4.67** <0.01
Error 18 670.5 37.25
Total 19 844.55
Conclusion: Reject H0
53
Coefficient of determination
a measure of the amount of the variability in y that is explained by its dependence on x.
Coeff of D. = R
Total
SS
SS
174.0520.6%
844.55
54
test for regression coefficient
• H0: β=0
37.252.729
5E
bX
MSs
SS
5.92.16 ~ (18)
2.729b
bt ts
55
Example 2:Yeast Data
Yeast colony grown on
agar.
Area measured (mm2)
on 9 successive days
and area transformed
to logs.
Time
(days)
Area
(log mm2)
1 3.62 3.8
3 4.2
4 4.5
5 5.0
6 5.2
7 5.5
8 5.6
9 6.1
x = 45 y = 43.5
56
Exponential
57
Scatterplot of Yeast dataA
rea
(2lo
g m
m)
0 2 4 6 81 3 5 7 9
3.0
4.0
5.0
6.0
Days
Apparent positive
linear relationship
58
Nonlinear->linear
Power-law:
log log log
Exponential:
log log log
b
X
Y aX
Y a b X
Y ab
Y a X b
59
Calculations9 45 43.5n X Y
2 22 ( ) 45
285 609X
XSS X
n
2 2 2 2
2 2 2 2
1 2 9 285
3.6 3.8 6.1 216.15
1 3.6 2 3.8 9 6.1 236.2
X
Y
XY
2 22 ( ) 43.5
216.15 5.99Y
YSS Y
n
45 43.5236.2 18.7
9XY
X YSS XY
n
Time
(days)
Area
(2log mm)
1 3.6
2 3.8
3 4.2
4 4.5
5 5.0
6 5.2
7 5.5
8 5.6
9 6.1
X = 45 Y = 43.5
60
Regression coefficient
Intercept atx
Intercept
regression equation
Estimation
18.70.3117
60XY
X
SSb
SS
60, 5.9, 18.7X Y XYSS SS SS
3.27 0.3117Y X
4.83 0.3117 5 3.27a Y b X
4.83a Y
4.83 0.3117( 5.0)Y X
61
To fit line
regression equation
Use two extreme values of x; 0 and 9
When x = 0, y = 3.27
When x = 9, y = 3.27 + 0.3117*9 = 6.08
3.27 0.3117Y a bX X
62
Fitting the best line.
0 2 4 6 81 3 5 7 9
3.0
4.0
5.0
6.0
Days
Are
a
(y= 3.27 + 0.3117x)
(0, 3.27)
(9, 6.08)
(5, 4.83)
63
Testing the significance ANOVA
H0: no linear relation between y and x. β=0
Ha: the variation in y is linearly explained
by the variation in x. i.e., β≠0
Item df SS MS F c.v.
Regression 1
Error n-2
Total n-1
64
ANOVA SS
60, 5.9, 18.7X Y XYSS SS SS
2 2
5.9
( ) 18.75.8282
60
5.9 5.8282 0.0718
Total Y
XYR
X
E Total R
SS SS
SSSS
SS
SS SS SS
65
Regression ANOVA
Item df SS MS F P
Regression 1 5.8282 5.8282 565.8** <0.01
Error 7 0.0718 0.0103
Total 8 5.9000
Conclusion: Reject H0 (of no relationship) and conclude
that a significant portion of the variability in colony area
is explained by regression on time.
66
Coefficient of determination
a measure of the amount of the variability in y that is explained by its dependence on x.
Coeff of D. = R
Total
SS
SS
5.828298.8%
5.9
67
inference from yeast data
H0: Log area has no linear relationship with time
Ha: Log area has a linear relationship with time
inference: Reject H0 and accept Ha,
i.e. log area changes linearly with time.
Moreover, it explains 98.8% of the variation.
68
Confidence interval for β
(standard error of
b)
95% confidence interval for β
2 0.0163= 0.0001717
60
0.0001717 0.0131
Eb
X
b
MSs
SS
s
1
2
0.3117 2.365 0.0131 0.2807
0.3117 2.365 0.0131 0.3427
L
L
0.311723.79 565.8
0.0131b
bt Fs
69
Confidence interval for
standard error
95% confidence interval limits
1
2
4.83 2.365 0.034 4.75
4.83 2.365 0.034 4.91
L
L
|Y X
2
ˆ 2
2
( )1[ ]
( )
1 (5 5)0.0103[ ] 0.034
9 60
i
iEY
i
X Xs MS
n X X
|5ˆ ˆ 4.83 0.3117 (5 5) 4.83i YY
?
?
70
Summary of Regression
Regression analysis is used when one variable is fixed
(x) and likely to cause variation in the other (y)
Graph data to ascertain linear relationship apparent
Calculate the regression equation using the least
squares method
Test the significance of this equation with ANOVA
If significant, plot the equation on the graphed data
Calculate required confidence intervals
71
Partial pressure CO2 (torr)
respiration rate(breaths/minute)
1 30 8.1
2 32 8.0
3 34 9.9
4 36 11.2
5 38 11.0
6 40 13.2
7 42 14.6
8 44 16.6
9 46 16.7
10 48 18.3
11 50 18.2
Example 3: The effect of carbon dioxide on respiration rate
① Construct a scatterplot of these data
② Compute the linear regression equation
③ Test the significance of this equation via ANOVA
④ Calculate the 95% CI for β
⑤ Find the predicted respiration rate for
48 torr and the 95% CI
⑥ Find the predicted respiration rate for
38 torr and the 95% CI
⑦ Why these two CI have difference
length?
72
Scat terpl ot of CO2 pressure andrespi rat i on rate
7
10
13
16
19
22
26 30 34 38 42 46 50 54
CO2 pressure
resp
irat
ion
rate
Positive linear relationship
73
2
2
11
440 40 18040
145.8 13.25 2082.04
6085
n
X X X
Y Y Y
XY
2 22
( )( ) 40 13.256085.0-
11 0.575( ) 440
18040-11
XY
X
X YXYSS nb
XSSX
n
13.25 0.575 40 9.745a Y bX
ˆ 9.745 0.575Y X
Compute the linear regression equation
74
CO2
res rate
5045403530
20
18
16
14
12
10
8
6
S 0. 671009R-Sq 97. 3%R-Sq(adj ) 97. 0%
Fi tted Li ne Pl otres rate = - 9. 745 + 0. 5750 CO2
75
Test the significance 2 2
2
22
22
( ) 145.82082.04 149.53
11
( )( )( ) 253
145.47( ) 440
149.53 145.47 4.05
Total
R
E Total R
YSS Y
n
X YXY
nSSX
Xn
SS SS SS
149.5310Total
0.454.059Remainder
<0.01323.10**145.47145.471Regression
PFMSSSdfItem
Conclusion: Reject H0 and accept Ha, respiration rate
changes linearly with partial pressure CO2
76
The CI for β: [0.503 , 0.647]
The CI for y48: [17.53 , 18.18] 0.65
The CI for y38: [11.89 , 12.32] 0.43
t1-α/2 =?2.365, 2.306, 2.262, 2.228
df 7, 8, 9, 10
1 21 1
2 2
,b bL b t s L b t s
ˆ ˆ1 | 2 |1 1
2 2
ˆ ˆ,Y X Y XY YL t s L t s
77
The difference length between the two CI is due to that 48 torr gets further from average partial pressure CO2-
40 torr than 38 torr.
The CI length increases when partial pressure is further from the mean value, as it is determined by the standard error of Y and this standard error is determined by the difference between individual partial pressure and average partial pressure.
2
ˆ 2
( )1[ ]
( )i
iEY
i
X Xs MS
n X X
ˆ ˆ1 | 2 |1 1
2 2
ˆ ˆ,Y X Y XY YL t s L t s
78
SinceAnd Standard error of is:
|ˆ ˆ ( )i Y X iY Y b X X 2 2
E b E XYs MS n s MS SS
Confidence Interval for |Y X
22
ˆ 2
( )1( ) [ ]
2 ( )i
iE E EiY
X i
X XMS MS SSs X X
n SS n n X X
|ˆY X
Sampling error
22
2
( )1( ) [1 ]
2 ( )i
iE E EY E i
X i
X XMS MS SSs MS X X
n SS n n X X
79
y = 0. 575x - 9. 7455R2 = 0. 9729
7
9
11
13
15
17
19
30 35 40 45 50
CO2
RES
rate
dataCI f or YCI f or YCI f or YiCI f or Yi
(data)线性
80
Concepts
Simple linear regression
Simple linear correlation
Correlation analysis based on ranks
81
Correlation Analysis
To measure the intensity of association
observed between any pair of variables
and test it’s statistic significance
To test whether two variables are covary
or interdependent
82
83
Correlation analysis
Two variables, X and Y, are of the same status and we can’t tell which cause and effect not clear.
There may be some common underlying cause of both.
Examples:
height and weight
height of siblings
84
Types of relationship.
y
x
Positive correlation; large X’s are associated with large Y’s
y
x
Negative correlation; large X’s are associated with small Y’s
y
x
No correlation; Y and X no linear correlation.
85
No linear correlation ≠ independent
Y=X2
Dependent
Linear
correlation
Independent
86
( )( )i i
X Y
X X Y Y
S S
Index of association
Positive correlation
Large X’s associated with large Y’s
Negative correlationLarge X’s associated with small Y’s
Standardized normal
deviates for X
Standardized normal
deviates for Y
( )( ) 0i iX X Y Y
( )( ) 0i iX X Y Y
( )( )i i
X Y
X X Y Y
S S
87
Pearson product-moment correlation coefficient
Abbr: Pearson correlation coefficient
2 22 2
( )( )
( 1)
( )( )
( ) ( )[ ][ ]
i i
X Y
XY
X Y
X X Y Yr
n S S
X YXY SSn
SS SSX YX Y
n n
Corrected sum of squares of X
Corrected sum of squares of Y
Corrected cross products
Sample size
88
Pearson correlation coefficient
A widely used index of association
Association of two quantitative variables
An estimate of population correlation
coefficient XY
X Y
COV(X, Y) = E[(X-E(X))(Y-E(Y))]
89
Population correlation efficient is
Sample correlation efficient is
Linear Correlation Model
22
1
)()(
))((
))((1
yx
yx
y
yN
x
x
YX
YX
YX
N
Normal standard deviation of X Normal standard deviation of Y
2 21
( )( )1( )( )
1 ( ) ( )
nxy
X Y x y
SSX X Y YX X Y Yr
n s s SS SSX X Y Y
90
Linear Correlation Model
Y’s at each X are normal distributed
X’s at each Y are normal distributed
Conditional population mean is
Which follow the regression equation:
/ /
/ /
( )
( )Y X Y Y X X
X Y X X Y Y
X
Y
/
/
ˆ ( )
ˆ ( )
Y X
X Y
Y Y b X X
X X b Y Y
91
Regression vs Correlation
22 ( )
( )XY XYX
X X
SS SSSS
SS SS
| |
2
| |
; ;
( )
XY XYY X X Y
X Y
XY XY XYY X X Y
X Y X Y
SS SSb b
SS SS
SS SS SSb b
SS SS SS SS
2R XSS b SS
2 2
| |
( ) ( )XY X XYY X X Y
Y X Y
SS SS SSb b
SS SS SS
2r
Coeff of D. = R
Total
SS
SSSSR: Explainable variability
SSTotal=SSY: Total variability
92
characteristics of correlation coefficient
2 . R R
Total Total
SS SSr Coeff of D r
SS SS
Total R ESS SS SS 0 R TotalSS SS
20 1, . . 0 1R
Total
SSi e r
SS
1 1r
93
r = +1: a complete positive correlation
between x and y
r = -1: a complete negative correlation
between x and y
r = 0 : x and y are not correlation
Correlation has upper and lower limits of +1 and -1 respectively
94
Test of hypothesis: t test Hypothesis:
To test if r differs from zero
The standard error of r is:
follow t distribution with df=n-2
Critical value for given n and alpha
0 : 0
: 0a
H
H
2
0
12
r
r rt
s rn
21
2r
rs
n
21 / 2
21 / 22
tr
n t
95
Regression = Correlation
2
2
. 1
1
R E
Total Total
E
Total
SS SSr Coeff of D
SS SS
SSr
SS
2/
( 2)12
XY E XYr
TotalX Y X E
SS SS SSrt
SS nSS SS SS MSrn
, 2XY X XYb
b X E X E
SS SS SSbt df n
s SS MS SS MS
96
Understand the correlation
analysis via example
97
① Preliminary calculations
The length and width of eight overlapping plates composing the
shell of 10 Chiton olivaceous
Animal X
Length (cm)Y
Width (cm)
1 10.7 5.8
2 11 6
3 9.5 5
4 11.1 6
5 10.3 5.3
6 10.7 5.8
7 9.9 5.2
8 10.6 5.7
9 10 5.3
10 12 6.3
2
105.8
10.58
1123.9
X
X
X
2
56.4
5.64
319.68
Y
Y
Y
10n
599.31XY
98
② Calculate the correlation coefficient:
③ Test of significance
Hypotheses:
test statistic:
since reject H0
④ Confidence interval ???
0 : 0
: 0a
H
H
2
0 0.969 0 0.96911.14
0.0871 0.96910 2
r
rt
s
0.05(8)11.14 2.306t t
2 2 2 22 2
( )( ) 105.8 56.4599.31
10 0.969( ) ( ) 105.8 56.4
[1123.9 ][319.68 ][ ][ ]10 10
X XXY
nrX Y
X Yn n
99
Need different tests of
Can ONLY use t-test to test H0 that = 0,
because only in this case (independent) is the
distribution approximate Normal.
In all other cases, the distribution is
asymmetrical and so we have to use a different
test as follows
100
Distribution of r in samples
-1 0 +1
= 0 = 0.8
Fisher’s Z transformation must be employed:
hyperbolic tangent of r
101
Fisher’s Z Transformation
Transform r to Z: hyperbolic tangent
Z follow normal distribution approximately
Standard error of Z
1
tanh
1 1 1tanh ln [ln(1 ) ln(1 )]
2 1 2
z z
z z
e er z
e er
z r r rr
1( , )
2( 1) 3N
n n
1
3Z n
102
Testing when 0 0
follow approximately normal
distribution:
Hypothesis:
Test statistic:
0 0
0
:
:a
H
H
1 1 1 10 0tanh tanh tanh tanh
~ (0,1)1
3Z
r ru N
n
1
1
tanh
tanhZ r
103
Confidence interval for correlation coefficient
95% CI for ζρ
95% CI for ρ
1
2
11.960 1.960
3
11.960 1.960
3
Z
Z
L Z Zn
L Z Zn
1 1
2 2
tanh
tanh
r L
r L
104
Example when expected 0
For genetical reasons, the
correlation of height among
sibs (brothers or sisters) is
expected to be 0.5
i.e., H0 is that =0.5
Family
X
brother
(cm)
Y
sister
(cm)
1 71 69
2 68 64
3 66 65
4 67 64
5 70 65
6 71 62
7 70 62
8 73 64
9 72 66
… … …
50 66 62
390.558
(74)(64)XY
X Y
SSr
SS SS
2 0.31142r
105
Test of significance when 0
for r=0.5
for r=0.558
Standard error is
1ln[(1 0.5) (1 0.5)] 0.5493
2Z
1 (50 3) 0.146Z
1ln[(1 0.558) (1 0.558)] 0.63
2Z
Therefore cannot reject H0
0 0.63 0.550.55 1.960
0.146Z
Z Zu
106
Example when expected 0
1
1
0.63 1.960 0.146 0.63 0.2859 0.3441
0.63 1.960 0.146 0.63 0.2859 0.9159
L
L
0.3311 0.558 0.7240
0.3441 0.63 0.9159
r
Z
107
Concepts
Simple linear regression
Simple linear correlation
Correlation analysis based on ranks
108
Bivariate random sample (not normal distributed)
first grade students’ ages and their performance on a
standardized test
student heights and GPAs
To test the relationship between the two variables
To validate dependence of two random variables.
109
The characteristics for an index of association
Values only between -1 and +1, inclusive
The stronger the positive correlation is, the
closer the value to be +1
The stronger the negative correlation is, the
closer the value to be -1
For uncorrelated pairs of X an Y, the value
should be close to 0
110
Kendall Correlation Coefficient τ Direct compare the n observations with each other
Concordant (C): positive correlation
Discordant (D): negative correlation
Tie (E):
2( )
( 1) 2 ( 1)
C D C D
n n n n
The total # of comparisons
=C+D+E
Difference between the # of
concordant and discordant pairs
( )( ) 0i j i jX X Y Y
1 1
( )( ) 0i j i jX X Y Y
( )( ) 0i j i jX X Y Y
111
0002.5122712
0104112911
0021102410
1112.59279
04098388
04187377
0425.56326
05275355
080104424
1445.53323
0100112472
0110121531
tied pairs below
discordant pairs below
Concordant pairs below
Rank YiRank XiYiXi① Rank X and Y
② Compare each
observations
with other
observations
below it
③ Summarize C,
D and E
C=4+…=12
D=11+…=52
E=1+1=2
④ Calculate Kendall correlation coefficient:
2( ) 2 (12 52)0.606
( 1) 12 (12 1)
C D
n n
112
Birthmonth
month after cut-off
students evaluated
Dec 1 53
Jan 2 47
Feb 3 32
Mar 4 42
Apr 5 35
May 6 32
Jun 7 37
Jul 8 38
Aug 9 27
Sep 10 24
Oct 11 29
Nov 12 27
The relative age effects on academic and social performance in Geneva.
The older students tend to be
overrepresented and younger
ones underrepresented.
X: the month
Y: the # of students in grades K
through 4 evaluated for the
district’s Gifted and Talented
Student Program.
113
⑥ Test significance: two-tailed test
Hypotheses:
test statistic:
Table C.12 for n=12, p=2*0.0027=0.0054<0.01
reject H0
Conclusion: there is moderately strong negative
correlation between month after cut-off date and
referrals for gifted evaluation.
0 : 0
: 0a
H
H
min( , ) 12 1 132 2
E EC D
114
Spearman’s rank coefficient rs
One of the most common correlation coefficients
Idea: Rank X and Y observations separately and
compute the Pearson correlation coefficient on the
ranks rather than on the original data.
Advantage: to compute more simply
2
12
61
( 1)
n
iis
dr
n n
i ii x ywhere d r r
115
1 1
2 2
1 1
2 2 2
1 1
2
1 1 1 12
2 2
( 1) / 2
( 1)(2 1) / 6
( ) / ( 1) /12
/ ( 1) / 4
( 1) /12
6 ( 2 )
n n
yi xii i
n n
yi xii i
n n
rx ry yi yii i
n n n n
xi yi xi yi xi yii i i ip
rx ry
xi xi yi yii
r r n n
r r n n n
SS SS r r n n n
r r r r n r r n nr
n nSS SS
r r r r
2
12
22
12
2
12
3 ( 1) 2 ( 1)(2 1)
( 1)
6 ( ) ( 1)
( 1)
61
( 1)
n
n
xi yii
n
iis
n n n n n
n n
r r n n
n n
dr
n n
116
X: the month
Y: the difference between
actual and expected
numbers of professional
players
Month Actual players Expected players Difference
1 37 28.27 8.73
2 33 27.38 5.62
3 40 26.26 13.74
4 25 27.60 -2.60
5 29 29.16 -0.16
6 33 30.05 2.95
7 28 31.38 -3.38
8 25 31.83 -6.83
9 25 31.16 -6.16
10 23 30.71 -7.71
11 30 30.93 -0.93
12 27 30.27 -3.27
The distribution of Germany professional players' birthdays and that of the general population of Germany.
117
① Rank X and Y
② Compute the difference
③ Calculate
A negative correlation
The distribution of Germany professional players' birthdays and that of the general population of Germany.
7512-3.2712
4711-0.9311
9110-7.7110
639-6.169
628-6.838
347-3.387
-3962.956
-385-0.165
-264-2.64
-912313.743
-81025.622
-101118.731
Y:DifferenceX: Month yrxr id
i x yd r r
2
12
2
61
( 1)
6 4941
12(12 1)
0.727
n
iis
dr
n n
2 2 2
1
[( 10) 7 ]
494
n
ii
d
118
④ Test significance
Hypotheses:
Table C.13 critical value = 0.587, for n=12 and α=0.05
since |rs|=0.727>0.587 reject H0
inference: there is an excess in the number of players
born early in the competition year and a lack of those
born late among professional soccer players in
Germany.
0 : 0
: 0s
a s
H
H
119
Compare the two coefficients
Kendall’s correlation coefficient τ
easy to test significant difference from zero
Spearman’s rank correlation coefficient rs
more common, related to Pearson’s correlation
coefficient
Produce no radically different correlation
values.
120
Comparing different regressions
Common situation is either to compare your regression with a
published one
or to compare the regressions in two or more experiments you have done
Two linear regression equations:
1 1 1 1
2 2 2 2
ˆ
ˆ
Y a b X
Y a b X
121
Compare intercept: Using t test
Hypothesis:Standard error for a1-a2:
follow t distribution with
0 1 2 1 2: 0H or
1 2
2 21 2
/ 221 2 12 2 2
1 2
1 1( )( )
( ) ( )a a Y X
x xs s
Xn n XX X
n n
21
)()( 2121
aas
aat
2212 2 2
1 1 2 2
/1 2
( )( )( )( 2) ( )( 2)
( 2) ( 2)Y X
YYY n Y n
n nsn n
1 2( 2) ( 2)df n n
122
Compare slope: Using t test
Hypothesis:Standard error for β1-β2:
follow t distribution with
0 1 2 1 2: 0H or
1 2 / 2 21 22 2
1 2
1 1( ) ( )
( ) ( )b b Y Xs s
X XX X
n n
21
)()( 2121
bbs
bbt
2212 2 2
1 1 2 2
/1 2
( )( )( )( 2) ( )( 2)
( 2) ( 2)Y X
YYY n Y n
n nsn n
1 2( 2) ( 2)df n n
123
Combining data from two experiments
If the data for two experiments are not significantly different, can combine them.
Combined estimate of b is:
Standard error is:
Should combine the original data.
1 1 2 2
1 2
2 21 1 2 22 2
1 1 2 21 2
2212 2 2
1 21 2
( ) ( )( ( ) ) ( ( ) )
( )( )( ) ( )
X Y X Y
X X
X Y X YX Y X Y
SS SS n nXSS SS X
X Xn n
1 21 2
2212 2 2
1 21 2
[ ( 2)] [ ( 2)]
( )( )( ) ( )
E ESS n SS n
XXX X
n n
124
Two experiments
y
x
Expt. 1
Expt. 2
Same slopes, different means
y
x
Expt. 1
Expt. 2
Different slopes, means may or may not differ
125
Compare two different correlations
Fisher’s Z transformationHypothesis: variance for z1-z2:
1 20 1 2:H or z z
1 2
2
1 2
1 1
3 3z z n n
1 2
1 2
1 2 1 2
1 2
( ) ( )~ (0,1)
1 13 3
z z
z z z z z zu N
n n