1
STATISTICS
Regression & Correlation
2
STATISTICS
Outline
X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE SLR: ANOVA table & R-square
SLR 、 ANOVA 、 2-s t test 的比較 Multiple Linear Regression Pearson’s correlation coefficient (r) R2, r, b 之間的關係 Z, t, F, 2 之間的關係
3
STATISTICS
X and Y
X: Y:
Predictor variables;Predictors;Covariates;Explanatory variables;Independent variables.
Outcome;Response;Dependent variables
4
STATISTICS
Univariate analysis: 1X1Y
X Y Comparisons MethodsBinary Num._normal 2 indep. means Two-sample t test*
Categorical Num._normal >= 2 indep. means One-way ANOVA*
Binary Num._non-normal 2 indep. medians Wilcoxon rank sum
Categorical Num._non-normal >= 2 indep. medians Kruskal-Wallis
num._normal Num._normal Regression*
Num._normal 2 related means Paired t
Num._non-normal 2 related medians Wilcoxon signed rank
Categorical Categorical X related to Y Pearson's Chi-sq
Categorical_Binary 2 related prop. McNemar Chi-sq
Categorical_Binary Categorical_Binary 2 indep. Prop. Pearson's Chi-sq
Categorical_Binary Categorical_Binary 2 indep. Prop. 2-Z
說明:有 * 的分析方法需要有以下假設 : normality Independence..
名詞縮寫 Cat.: categorical; Num.: numerical
5
STATISTICS
Multivariate analysis: Xs1YXs Y MethodsCategorical Cat. Log-linear
Cat.+Num. Cat.(binary) Logistic regression
Cat.+Num. Cat.(>=3) Logistic regression
Dicriminant analysis*
Cluster analysis
Propensity scores
CART
Cat. Num. ANOVA*
MANOVA*
Num. Num. Multiple regression*
Cat.+Num. Num.(censored) Cox Propotional hazard model
Confounding factors Num. ANCOVA*
MANOVA*
GEE*
Confounding factors Cat. Mantel-Haenszel
Num. Factor analysis
說明:有 * 的分析方法需要有以下假設 : Multivariate normality Independence..
名詞縮寫 Cat.: categorical; Num.: numerical CART: classification and regressio
n tree ANOVA: analysis of variance ANCOVA: analysis of covariance MANOVA: multivariate analysis of v
ariance GEE: generalized estimating equati
ons
6
STATISTICS
Regression Models
Mathematical models to describe the relationship between Y and X
The use of regression modelAdjustmentPredictionFinding important factors for Y
7
STATISTICS
Regression Models
Definition: Mathematical models to describe the relationship betwee
n Y and X Purpose: The use of regression model:Find important factors for Y and/or Prediction
8
STATISTICS
Simple linear regression (SLR)
Model:
XY
XYENXY
10
102
10 )(),0(~
9
STATISTICS SLR Example
年齡跟膽固醇間是否有直線關係ID AGE CHOL
1 34 141.4
2 39 180.5
3 44 178.4
4 46 212
5 48 203.2
6 51 224.1
7 53 186
8 60 350
9 61 286.3
10 65 287.6
11 66 330.3
12 67 311.3
10
STATISTICS
SLR: parameter estimation
The least square method
Point estimate:
N
iii XY
1
210 )(min
slope estimated :ˆ
intercept estimated :ˆ
1
0
11
STATISTICS
The logic of SLR: SST=SSR+SSE
1Y
1Y
Y
XY 10ˆˆˆ
2Y
2Y
YY 1
11 YY YY 1
1X
2222 )ˆ()ˆ()ˆˆ()( YYYYYYYYYY
SST = SSE + SSR
Total amount unexplained at Xi
amount at Xi unexplained by regression
amount at Xi explained by regression
12
STATISTICS
SLR: parameter estimation
The least square methodmin SSE:
Point estimate分別對截距與斜率做偏微分,可求出截距與斜率
截距
斜率
210
22 )()ˆ( iii XYYYS
0)(2 100
ii XYS
0)(2 101
iii XYXS
21)(
))((
XX
YYXXb
i
ii
XbYb 10
13
STATISTICS
SLR example: Regression line
100.0
162.5
225.0
287.5
350.0
30.0 40.0 50.0 60.0 70.0
CHOL vs Age
Age
CH
OL
Estimated Model: CHOL=(-57.5964988786446) + ( 5.65024919013205) * (Age)
14
STATISTICS
SLR: ANOVA table & R-square
Source DF SS MSS F p Power(5%)
Intercept 1 696538.3 696538.3
Slope 1 42705.43 42705.43 45.4538 0.0001 1.0000
Error 10 9395.352 939.5352
Adj. Total 11 52100.78 4736.435
Total 12 748639.1
R2=0.82, p=0.0001
15
STATISTICS
SLR: qualitative covariate
Example: X=treatment, 1 or 0 Y=SBP
Hypothesis H0: β1 = 0 H1: β1≠0
與平均值檢定的比較 : H0: μ1 = μ0 H1: μ1≠μ0
Note: β1 = μ1 - μ0
16
STATISTICS
SLR 、 ANOVA 、 2-s t test 的比較
ID Y X ID Y X - -
1 140 A 1 140 0
2 135 B 2 135 1
2-s t →ANOVA 2-s t →SLR H0: μ1 = μ0 → H0: β1 = 0
Dummy variable: K 組需要K-1個
ANOVA →SLR H0: μ1 = μ2 = μ3 → H0: β1 = β2 = 0
ID Y X ID Y X1 X2 -
1 140 A 1 140 0 0
2 135 B 2 135 0 1
3 130 C 3 130 1 0
17
STATISTICS
Multiple Linear Regression
Model
Example: Is Age a predictor for SBP adjusting for Sex?
pp
ppY
pp
XXY
XXYE
XXY
ˆ...ˆˆˆ
...)(
...
100
100
100
SEXAGEY 210ˆˆˆˆ
18
STATISTICS
MLR: example
female
male
Age
SBP
AGEY 1*0
ˆˆˆ
AGEY 10ˆˆˆ
0*0
ˆˆ
19
STATISTICS
Pearson’s correlation coefficient (r)
Relationship btw X and Y
Properties of Pearson’s r Range: Unitless Good for normally distributed X and Y 相關係數 r :可視為是多維空間中,兩個向量的 cos 值
Spearman’s correlation coefficient Pearson’s r for ranked X and Y Good for non- normally distributed X and Y
22 )()(
))((
YYXX
YYXXr
ii
ii
11 r
20
STATISTICS
Spearman’s Rho: rank correlation
Relationship btw X and Y
Spearman’s correlation coefficient Pearson’s r for ranked X and Y Good for non- normally distributed X and Y
222 1
2
)()(
))((
S
S
YYXX
YYXXs
r
nrt
RRRR
RRRRr
21
STATISTICS
Assumptions in Regression
Linear Independent Normal distribution Equal Variance 說明: For all the values of x,
εare independent, normally distributed, have the same SD σ = σ (ε) mean μ = 0
y = α
+ βx
We
igh
t
Height
y = α
+ βx
We
igh
t
Height
Yi = α0 + β1Xi + εiα and β are the unknown parametersε = random error fluctuations
22
STATISTICS
R2, r, b 之間的關係
r and b
r2: Coefficient of Determination: The proportion of the variability among the observed values of
Y that is explained by the linear regression of Y on X. Y 的變異量可以被 X 迴歸後所解釋的百分比
Y
X
SD
SDbrb
yy
xxrSSEYYSSRr
2
2
2222
)(
)(1)(/
22 )()(
))((
YYXX
YYXXr
ii
ii
21)(
))((
XX
YYXXb
i
ii
23
STATISTICS
r, b 之間的關係 : 正負同號
r 大 b 小 r 小 b 大
24
STATISTICS
迴歸線的幾個標準差 1 :
名 稱 (1). 估計標準誤 SE of estimate
(2). 迴歸線標準誤 SE of RL(Ŷ 的抽樣分佈標準差 )
(3). 預測標準誤 SE of prediction
楊志良 迴歸線的標準差 迴歸線標準誤 估計標準誤** 該名詞易混淆
意義 任一觀察值 Y 與回歸直線間的垂直距離的分布變異以迴歸線代替平均值算出來的標準差
以重複抽樣的多個相同的 X值來計算 Y 的標準誤,亦即 Ŷ值的第二個層次的常態分布的標準差,估計單一 E(y) 的 CI 用
以一個 X 預測 Y 的標準誤,亦即某個 X 值上, Y 值的第一個層次的常態分布的標準差
25
STATISTICS
迴歸線的幾個標準差 2 : The Standard Error of the Estimate
SE of RL
SE of prediction
)2/()1()(
)2/()()()2/()ˆ()(
22
221
2222.
nrYY
nXXbYYnYYYVS XY
)(:....)(
)(
),()(2)]([)()]([)()ˆ(
2
222
111102ˆ
aNotefromXX
XX
n
bYCOVXXXXbVYVXXbYVxbbVYVSY
2:]....)(
)(11[
)ˆ,(2)ˆ()()ˆ(ˆ
2
22
2
abovefromXX
XX
n
YYCOVYVYVYYVSY
26
STATISTICS
迴歸線的幾個標準差 3 :
Note (a): b1的變異數
Note (b): b0的變異數
)(
)()(
)(]
)(
)([]
)(
)()([)(
22
2
221 YVXXXX
XX
XX
YXXV
XX
YYXXVbV
2
2
1)(
)(XX
bV
))(
1(
)(:....)(
),(2)()()()(
2
2
2
2
22
2
11
2
10
XX
X
n
aNotefromXX
Xn
bYCOVXbXVYVYbYVbV
27
STATISTICS
例題: 10 位 30-39 歲男子於最初所做的血膽固醇量 (X) ,與相隔 10 年後所做
的量 (Y) 兩次的比較如下 ( 資料來源 : 彭游生物統計學, 89 年, P374) ,請問:
迴歸係數是多少?截距是多少? 相關係數 r 是多少 相關係數是否有統計上的意義?已知 F0.05 (1,8) =5.32 有多少 10 年後膽固醇值的變異是由 10 年前膽固醇值的變異所引起的? 樣本的迴歸係數是否具統計意義? 某個男性目前的膽固醇為 350 ,請預測 10 年後的膽固醇和其 95%CI 某群男性的平均膽固醇為 350 ,則其 10 年後的膽固醇和其 95%CI 為多少?
部分解答 :
28
STATISTICS
例題:部分解答 ( 續 )
29
STATISTICS
主題: Y 為類別變項的預測 Predicting Nominal or categorical outcome
有無生病;有無死亡 Odds Ratio ( 勝算比 ; 危險對比值 )
研究設計:橫斷法: Cross sectional study世代追蹤法: Cohort study (Follow-up study)個案對照法: Case-control study臨床實驗法: Clinical trial
Logistic Regression
30
STATISTICS
Odds ratio
X
暴露組 (+ ) 非暴露組 (- ) 總和
Y有病 (+ ) A C A+C沒病 (- ) B D B+D
總和 A+B C+D A+B+C+D
Odds 是機率的另一種表示方法 Odds 就是賠率
危險對比值 (Odds ratio) 暴露組發病率 : p1 = A / (A+B)
對照組發病率 : p0 = C / (C+D)
世界杯足球賽巴西隊的賭盤為 1 賠 1 ,中國隊則為 1 賠 100 巴西與中國的勝算比為何 ?
)(1
)(
xp
xpodds
BC
AD
DCD
DCC
BAB
BAA
p
p
p
pOR
)/(
)/(
)/(
)/(
11 0
0
1
1
31
STATISTICS
流行病學的研究設計:
橫斷法: Cross sectional study世代追蹤法: Cohort study (Follow-up study)個案對照法: Case-control study臨床實驗法: Clinical trial
32
STATISTICS
流行病學的偏差 (bias)
選擇性偏差 : selection bias資訊性偏差 : information bias錯誤歸類 : misclassification
干擾因子 : confounding
33
STATISTICS
橫斷法
研究目的:盛行率調查衛生行政需求
研究關鍵:研究對象要有代表性:隨機抽樣
研究限制:沒有時序性,無法確定因果關係
34
STATISTICS
個案對照法
研究目的: 因果分析 個案組與對照組的暴露率比較
研究關鍵: 對照組的挑選 對照組要能代表個案組所來自的母群
體的暴露經驗 研究限制:
時序性 回憶偏差 (recall bias)
E E
D D
35
STATISTICS
世代研究法 ( 追蹤研究法 )
研究目的:因果分析暴露組與非暴露組的疾病發生率比較
研究關鍵:追蹤
研究限制:失去追蹤
E E
D
36
STATISTICS
干擾因子 Confounding factors
干擾因子的定義:本身單獨與疾病有相關;本身是危險因子干擾因子與危險因子有相關干擾不能是中介變項:
X1X2YMI
Obesity
Cholesterol
37
STATISTICS
臨床實驗法
研究目的:評估介入 (intervention)效果介入:藥物治療,衛生教育
研究關鍵:隨機分派 (randomization) :控制干擾因子安慰劑效應 (placebo effect)
研究限制:倫理道德問題
38
STATISTICS
各種 Study Designs 之間的關係
Case-control studyMatched case-control study
Cohort studyMatched cohort study
Randomization clinical trialComplete matched cohort study
Causality and correlationY=a+b1X1+b2X2+b3X3+b4X4+b5X5… covariate, confounder
EE EE
39
STATISTICS
Logistic regression:
Simple linear regression: Logistic regression: Y 為二分類別變項如何使 Y從 (0,1)到 (- ∞, ∞)?Logistic transformation
),(~)|( 10 xxYE
),(~)(1
)(ln)( )1,0(~
)exp(1
)exp()|()( 10
10
10
xxp
xpxg
x
xxYExp
40
STATISTICS
Logistic regression 係數與 OR
OR : exp(beta) 若該 X 變項是三組以上的類別變項,表示與參考組比較的 OR 若該 X 變項是連續變項,表示每增加一單位的 X ,會增加多少 O
R
若model 有多個 X 變項,解讀相同,但要加上「其他 X 變項保持不變下」的條件
舉例 : X 代表性別,男性 x=1 ,女性 x=0 ; Y 代表自殺的有無
)ln()0()1()0(;)1( 1010 ORgggg
41
STATISTICS
課本例子: LR
men with unintentional injurySoderstrom, 1997 Table 10-5,p247
結論:週末的晚上到急診室的白人,有較高的機率血中酒精濃度過高 (BAC>50mg/Dl) ;
年紀則沒有統計差異。
42
STATISTICS
Z, t, F, 2 之間的關係 Z2 , chi-square母群體平均值已知:
定義: 或
結論:
Zx
i ni
i
n2 2
2
21
( )
( )Z
x
ni ni
i
n2 2
2
21
( )
( )
/
n
xZ
/
)(2
22
)1(2
1
43
STATISTICS
Z, t, F, 2 之間的關係F ,chi-square
母群體平均值未知:
定義:
結論:
Zx x n s
i ni
i
n2
12
2
21
2
2
1
( )
( ) ( )2
22)1(
1 s
nn
1
2)1(
2
2
,1
n
sF n
df
Fs
sdf F
sdf df df1 2
12
22 1
12
22, ,,
44
STATISTICS
Z, t, F, 2 之間的關係
)2,1,( dfdfF
2)2,(2/1)2,1,( dfdf tF
1
2
),1,( dfF df
22/1),1,( zF