Transcript
Page 1: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

1

STATISTICS

Regression & Correlation

Page 2: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

2

STATISTICS

Outline

X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE SLR: ANOVA table & R-square

SLR 、 ANOVA 、 2-s t test 的比較 Multiple Linear Regression Pearson’s correlation coefficient (r) R2, r, b 之間的關係 Z, t, F, 2 之間的關係

Page 3: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

3

STATISTICS

X and Y

X: Y:

Predictor variables;Predictors;Covariates;Explanatory variables;Independent variables.

Outcome;Response;Dependent variables

Page 4: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

4

STATISTICS

Univariate analysis: 1X1Y

X Y Comparisons MethodsBinary Num._normal 2 indep. means Two-sample t test*

Categorical Num._normal >= 2 indep. means One-way ANOVA*

Binary Num._non-normal 2 indep. medians Wilcoxon rank sum

Categorical Num._non-normal >= 2 indep. medians Kruskal-Wallis

num._normal Num._normal Regression*

Num._normal 2 related means Paired t

Num._non-normal 2 related medians Wilcoxon signed rank

Categorical Categorical X related to Y Pearson's Chi-sq

Categorical_Binary 2 related prop. McNemar Chi-sq

Categorical_Binary Categorical_Binary 2 indep. Prop. Pearson's Chi-sq

Categorical_Binary Categorical_Binary 2 indep. Prop. 2-Z

說明:有 * 的分析方法需要有以下假設 : normality Independence..

名詞縮寫 Cat.: categorical; Num.: numerical

Page 5: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

5

STATISTICS

Multivariate analysis: Xs1YXs Y MethodsCategorical Cat. Log-linear

Cat.+Num. Cat.(binary) Logistic regression

Cat.+Num. Cat.(>=3) Logistic regression

Dicriminant analysis*

Cluster analysis

Propensity scores

CART

Cat. Num. ANOVA*

MANOVA*

Num. Num. Multiple regression*

Cat.+Num. Num.(censored) Cox Propotional hazard model

Confounding factors Num. ANCOVA*

MANOVA*

GEE*

Confounding factors Cat. Mantel-Haenszel

Num. Factor analysis

說明:有 * 的分析方法需要有以下假設 : Multivariate normality Independence..

名詞縮寫 Cat.: categorical; Num.: numerical CART: classification and regressio

n tree ANOVA: analysis of variance ANCOVA: analysis of covariance MANOVA: multivariate analysis of v

ariance GEE: generalized estimating equati

ons

Page 6: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

6

STATISTICS

Regression Models

Mathematical models to describe the relationship between Y and X

The use of regression modelAdjustmentPredictionFinding important factors for Y

Page 7: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

7

STATISTICS

Regression Models

Definition: Mathematical models to describe the relationship betwee

n Y and X Purpose: The use of regression model:Find important factors for Y and/or Prediction

Page 8: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

8

STATISTICS

Simple linear regression (SLR)

Model:

XY

XYENXY

10

102

10 )(),0(~

Page 9: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

9

STATISTICS SLR Example

年齡跟膽固醇間是否有直線關係ID AGE CHOL

1 34 141.4

2 39 180.5

3 44 178.4

4 46 212

5 48 203.2

6 51 224.1

7 53 186

8 60 350

9 61 286.3

10 65 287.6

11 66 330.3

12 67 311.3

Page 10: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

10

STATISTICS

SLR: parameter estimation

The least square method

Point estimate:

N

iii XY

1

210 )(min

slope estimated :ˆ

intercept estimated :ˆ

1

0

Page 11: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

11

STATISTICS

The logic of SLR: SST=SSR+SSE

1Y

1Y

Y

XY 10ˆˆˆ

2Y

2Y

YY 1

11 YY YY 1

1X

2222 )ˆ()ˆ()ˆˆ()( YYYYYYYYYY

SST = SSE + SSR

Total amount unexplained at Xi

amount at Xi unexplained by regression

amount at Xi explained by regression

Page 12: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

12

STATISTICS

SLR: parameter estimation

The least square methodmin SSE:

Point estimate分別對截距與斜率做偏微分,可求出截距與斜率

截距

斜率

210

22 )()ˆ( iii XYYYS

0)(2 100

ii XYS

0)(2 101

iii XYXS

21)(

))((

XX

YYXXb

i

ii

XbYb 10

Page 13: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

13

STATISTICS

SLR example: Regression line

100.0

162.5

225.0

287.5

350.0

30.0 40.0 50.0 60.0 70.0

CHOL vs Age

Age

CH

OL

Estimated Model: CHOL=(-57.5964988786446) + ( 5.65024919013205) * (Age)

Page 14: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

14

STATISTICS

SLR: ANOVA table & R-square

Source DF SS MSS F p Power(5%)

Intercept 1 696538.3 696538.3

Slope 1 42705.43 42705.43 45.4538 0.0001 1.0000

Error 10 9395.352 939.5352

Adj. Total 11 52100.78 4736.435

Total 12 748639.1

R2=0.82, p=0.0001

Page 15: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

15

STATISTICS

SLR: qualitative covariate

Example: X=treatment, 1 or 0 Y=SBP

Hypothesis H0: β1 = 0 H1: β1≠0

與平均值檢定的比較 : H0: μ1 = μ0 H1: μ1≠μ0

Note: β1 = μ1 - μ0

Page 16: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

16

STATISTICS

SLR 、 ANOVA 、 2-s t test 的比較

ID Y X ID Y X - -

1 140 A 1 140 0

2 135 B 2 135 1

2-s t →ANOVA 2-s t →SLR H0: μ1 = μ0 → H0: β1 = 0

Dummy variable: K 組需要K-1個

ANOVA →SLR H0: μ1 = μ2 = μ3 → H0: β1 = β2 = 0

ID Y X ID Y X1 X2 -

1 140 A 1 140 0 0

2 135 B 2 135 0 1

3 130 C 3 130 1 0

Page 17: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

17

STATISTICS

Multiple Linear Regression

Model

Example: Is Age a predictor for SBP adjusting for Sex?

pp

ppY

pp

XXY

XXYE

XXY

ˆ...ˆˆˆ

...)(

...

100

100

100

SEXAGEY 210ˆˆˆˆ

Page 18: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

18

STATISTICS

MLR: example

female

male

Age

SBP

AGEY 1*0

ˆˆˆ

AGEY 10ˆˆˆ

0*0

ˆˆ

Page 19: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

19

STATISTICS

Pearson’s correlation coefficient (r)

Relationship btw X and Y

Properties of Pearson’s r Range: Unitless Good for normally distributed X and Y 相關係數 r :可視為是多維空間中,兩個向量的 cos 值

Spearman’s correlation coefficient Pearson’s r for ranked X and Y Good for non- normally distributed X and Y

22 )()(

))((

YYXX

YYXXr

ii

ii

11 r

Page 20: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

20

STATISTICS

Spearman’s Rho: rank correlation

Relationship btw X and Y

Spearman’s correlation coefficient Pearson’s r for ranked X and Y Good for non- normally distributed X and Y

222 1

2

)()(

))((

S

S

YYXX

YYXXs

r

nrt

RRRR

RRRRr

Page 21: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

21

STATISTICS

Assumptions in Regression

Linear Independent Normal distribution Equal Variance 說明: For all the values of x,

εare independent, normally distributed, have the same SD σ = σ (ε) mean μ = 0

y = α

+ βx

We

igh

t

Height

y = α

+ βx

We

igh

t

Height

Yi = α0 + β1Xi + εiα and β are the unknown parametersε = random error fluctuations

Page 22: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

22

STATISTICS

R2, r, b 之間的關係

r and b

r2: Coefficient of Determination: The proportion of the variability among the observed values of

Y that is explained by the linear regression of Y on X. Y 的變異量可以被 X 迴歸後所解釋的百分比

Y

X

SD

SDbrb

yy

xxrSSEYYSSRr

2

2

2222

)(

)(1)(/

22 )()(

))((

YYXX

YYXXr

ii

ii

21)(

))((

XX

YYXXb

i

ii

Page 23: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

23

STATISTICS

r, b 之間的關係 : 正負同號

r 大 b 小 r 小 b 大

Page 24: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

24

STATISTICS

迴歸線的幾個標準差 1 :

名 稱 (1). 估計標準誤 SE of estimate

(2). 迴歸線標準誤 SE of RL(Ŷ 的抽樣分佈標準差 )

(3). 預測標準誤 SE of prediction

楊志良 迴歸線的標準差 迴歸線標準誤 估計標準誤** 該名詞易混淆

意義 任一觀察值 Y 與回歸直線間的垂直距離的分布變異以迴歸線代替平均值算出來的標準差

以重複抽樣的多個相同的 X值來計算 Y 的標準誤,亦即 Ŷ值的第二個層次的常態分布的標準差,估計單一 E(y) 的 CI 用

以一個 X 預測 Y 的標準誤,亦即某個 X 值上, Y 值的第一個層次的常態分布的標準差

Page 25: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

25

STATISTICS

迴歸線的幾個標準差 2 : The Standard Error of the Estimate

SE of RL

SE of prediction

)2/()1()(

)2/()()()2/()ˆ()(

22

221

2222.

nrYY

nXXbYYnYYYVS XY

)(:....)(

)(

),()(2)]([)()]([)()ˆ(

2

222

111102ˆ

aNotefromXX

XX

n

bYCOVXXXXbVYVXXbYVxbbVYVSY

2:]....)(

)(11[

)ˆ,(2)ˆ()()ˆ(ˆ

2

22

2

abovefromXX

XX

n

YYCOVYVYVYYVSY

Page 26: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

26

STATISTICS

迴歸線的幾個標準差 3 :

Note (a): b1的變異數

Note (b): b0的變異數

)(

)()(

)(]

)(

)([]

)(

)()([)(

22

2

221 YVXXXX

XX

XX

YXXV

XX

YYXXVbV

2

2

1)(

)(XX

bV

))(

1(

)(:....)(

),(2)()()()(

2

2

2

2

22

2

11

2

10

XX

X

n

aNotefromXX

Xn

bYCOVXbXVYVYbYVbV

Page 27: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

27

STATISTICS

例題: 10 位 30-39 歲男子於最初所做的血膽固醇量 (X) ,與相隔 10 年後所做

的量 (Y) 兩次的比較如下 ( 資料來源 : 彭游生物統計學, 89 年, P374) ,請問:

迴歸係數是多少?截距是多少? 相關係數 r 是多少 相關係數是否有統計上的意義?已知 F0.05 (1,8) =5.32 有多少 10 年後膽固醇值的變異是由 10 年前膽固醇值的變異所引起的? 樣本的迴歸係數是否具統計意義? 某個男性目前的膽固醇為 350 ,請預測 10 年後的膽固醇和其 95%CI 某群男性的平均膽固醇為 350 ,則其 10 年後的膽固醇和其 95%CI 為多少?

部分解答 :

Page 28: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

28

STATISTICS

例題:部分解答 ( 續 )

Page 29: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

29

STATISTICS

主題: Y 為類別變項的預測 Predicting Nominal or categorical outcome

有無生病;有無死亡 Odds Ratio ( 勝算比 ; 危險對比值 )

研究設計:橫斷法: Cross sectional study世代追蹤法: Cohort study (Follow-up study)個案對照法: Case-control study臨床實驗法: Clinical trial

Logistic Regression

Page 30: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

30

STATISTICS

Odds ratio

X

暴露組 (+ ) 非暴露組 (- ) 總和

Y有病 (+ ) A C A+C沒病 (- ) B D B+D

總和 A+B C+D A+B+C+D

Odds 是機率的另一種表示方法 Odds 就是賠率

危險對比值 (Odds ratio) 暴露組發病率 : p1 = A / (A+B)

對照組發病率 : p0 = C / (C+D)

世界杯足球賽巴西隊的賭盤為 1 賠 1 ,中國隊則為 1 賠 100 巴西與中國的勝算比為何 ?

)(1

)(

xp

xpodds

BC

AD

DCD

DCC

BAB

BAA

p

p

p

pOR

)/(

)/(

)/(

)/(

11 0

0

1

1

Page 31: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

31

STATISTICS

流行病學的研究設計:

橫斷法: Cross sectional study世代追蹤法: Cohort study (Follow-up study)個案對照法: Case-control study臨床實驗法: Clinical trial

Page 32: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

32

STATISTICS

流行病學的偏差 (bias)

選擇性偏差 : selection bias資訊性偏差 : information bias錯誤歸類 : misclassification

干擾因子 : confounding

Page 33: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

33

STATISTICS

橫斷法

研究目的:盛行率調查衛生行政需求

研究關鍵:研究對象要有代表性:隨機抽樣

研究限制:沒有時序性,無法確定因果關係

Page 34: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

34

STATISTICS

個案對照法

研究目的: 因果分析 個案組與對照組的暴露率比較

研究關鍵: 對照組的挑選 對照組要能代表個案組所來自的母群

體的暴露經驗 研究限制:

時序性 回憶偏差 (recall bias)

E E

D D

Page 35: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

35

STATISTICS

世代研究法 ( 追蹤研究法 )

研究目的:因果分析暴露組與非暴露組的疾病發生率比較

研究關鍵:追蹤

研究限制:失去追蹤

E E

D

Page 36: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

36

STATISTICS

干擾因子 Confounding factors

干擾因子的定義:本身單獨與疾病有相關;本身是危險因子干擾因子與危險因子有相關干擾不能是中介變項:

X1X2YMI

Obesity

Cholesterol

Page 37: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

37

STATISTICS

臨床實驗法

研究目的:評估介入 (intervention)效果介入:藥物治療,衛生教育

研究關鍵:隨機分派 (randomization) :控制干擾因子安慰劑效應 (placebo effect)

研究限制:倫理道德問題

Page 38: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

38

STATISTICS

各種 Study Designs 之間的關係

Case-control studyMatched case-control study

Cohort studyMatched cohort study

Randomization clinical trialComplete matched cohort study

Causality and correlationY=a+b1X1+b2X2+b3X3+b4X4+b5X5… covariate, confounder

EE EE

Page 39: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

39

STATISTICS

Logistic regression:

Simple linear regression: Logistic regression: Y 為二分類別變項如何使 Y從 (0,1)到 (- ∞, ∞)?Logistic transformation

),(~)|( 10 xxYE

),(~)(1

)(ln)( )1,0(~

)exp(1

)exp()|()( 10

10

10

xxp

xpxg

x

xxYExp

Page 40: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

40

STATISTICS

Logistic regression 係數與 OR

OR : exp(beta) 若該 X 變項是三組以上的類別變項,表示與參考組比較的 OR 若該 X 變項是連續變項,表示每增加一單位的 X ,會增加多少 O

R

若model 有多個 X 變項,解讀相同,但要加上「其他 X 變項保持不變下」的條件

舉例 : X 代表性別,男性 x=1 ,女性 x=0 ; Y 代表自殺的有無

)ln()0()1()0(;)1( 1010 ORgggg

Page 41: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

41

STATISTICS

課本例子: LR

men with unintentional injurySoderstrom, 1997 Table 10-5,p247

結論:週末的晚上到急診室的白人,有較高的機率血中酒精濃度過高 (BAC>50mg/Dl) ;

年紀則沒有統計差異。

Page 42: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

42

STATISTICS

Z, t, F, 2 之間的關係 Z2 , chi-square母群體平均值已知:

定義:            或    

結論:

Zx

i ni

i

n2 2

2

21

( )

( )Z

x

ni ni

i

n2 2

2

21

( )

( )

/

n

xZ

/

)(2

22

)1(2

1

Page 43: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

43

STATISTICS

Z, t, F, 2 之間的關係F ,chi-square

母群體平均值未知:

定義:

結論:

Zx x n s

i ni

i

n2

12

2

21

2

2

1

( )

( ) ( )2

22)1(

1 s

nn

1

2)1(

2

2

,1

n

sF n

df

Fs

sdf F

sdf df df1 2

12

22 1

12

22, ,,

Page 44: S TATISTICS 1 Regression & Correlation. S TATISTICS 2 Outline  X, Y & Regression Models Simple linear regression (SLR) The logic of SLR: SST=SSR+SSE

44

STATISTICS

Z, t, F, 2 之間的關係

)2,1,( dfdfF

2)2,(2/1)2,1,( dfdf tF

1

2

),1,( dfF df

22/1),1,( zF


Top Related