chapter 10 분포와 도수분석 -...

Chapter 10 분포와 도수분석 Chi-square dist’n &

the analysis of frequencies

2014/5/22

2χ

10.2 분포의 수리적 특징

• 의 응용 (Usage) 적합도 검정(Tests of Goodness-of-Fit) 독립성 검정(Tests of Independence) 동질성 검정(Tests of Homogeneity)

2χ

χ

µµ σ

σ

χ

=

−=

∑

2

1

2 2

1

2

, , ~ (0,1)

~

. . ~ ( , )

n

n

nii

i

Z Z independent N

Z

Yie g Z Y Ni

의 정의 (definition)

2χ

10.3 적합도 검정(Goodness-of- fit)

• 우리의 data가 가설상의 분포(정규분포, 이항분포, 포아슨 분포 등)와 일치하는가?

• Data = theoretical distribution (normal, binomial, Poisson, etc) ?

Inpatient occupancy ratio # of hospitals

•보기 10.3.1(Normal dist’n)

( )χ χ

−=

−= ∑

2

2 2

1

~k

i ik r

i i

O E

EOi Ei

∑ =∑O Ei i

관측치(observed) 기대치(expected)

r : 제약조건 ( )+추정하는 모수의 개수

# restriction # parameters estimated

interval expected rel. freq expected freq

Observed freq

• Reject : 정규분포를 따르지 않는다.

-> Not normally distributed.

• 기대도수가 충분히 커야 ( >10)근사값이 좋음. <5인 경우 cell을 합쳐서 10보다 크게 다시 범주화 시켜야 한다.

• Chi-square approximation is valid when expected freq is large enough ( >10). When <5, we can re-categorize the levels to have enough cell sizes.

( )χ χ

−=

−= = > =∑

2

2 29 3

1

25.854 (0.005) 18.548k

i i

i i

O E

E

0H

iEiE

• 보기 10.3.1의 결과

iEiE

보기 10.3.2 이항분포 (binomial dist’n)

이항분포의 가정하에서 기대도수=기대상대도수*총합

Expected freq under binomial dist’n=prob*total 2525

( ) (0.2) (1 0.2) , 0,1, 2, , 25

0.2 500 / 2500

x xp x xx

p

− = − =

= ≈

2 22 2

10 2(11 2.74) (0 1.73) 47.624

2.74 1.73χ χ −

− −= + + =

유의하므로 이항분포의 귀무가설을 기각한다.

Significant -> reject Ho (data~binomial dist’n)

보기 10.3.3 포아슨분포 (Poisson dist’n)

포아슨분포의 가정 하에서 상대도수의 기대치

Expected relative freq under Poisson dist’n

( ) , 0,1, 2,!

xep x xx

λλ−

= = =3: known λ

2 22 2

9 1(5 4.50) (2 1.08) 3.664 15.557 (0.05)

4.50 1.08χ χ −

− −= + + = < =

homework

• 연습문제 10.3.5

10.4 독립성검정 Tests of independence

• 분할표(contingency table)

serious

None

some

Progress of disease

•보기 10.4.1 Blood type

( )χ χ

= =

− −

= × = ×

⋅= × =

−= ∑ ∑

1, 2, 3 1, 2, 3, 4..

. . ..2

.. .. ..

2

2 2( 1)( 1)# #

~

i jij i j ij ij

j i jiij

ij ij

r ci j row colij

P P P E P N

N N NNP

N N N

O E

E

0H : two variables are independent.

• 작은 기대도수 (small expected freq) 기대치 5미만의 cell수가 전체 20%를 넘지 않으며, 최소기대치가 1이상이면 무관하다. (If min >1 and cells <5 are less than 20% then not a problem)

• 2Ⅹ2 분할표 (table) n<20 or 20<n<49 그리고 기대도수 5이하 일 경우에는 -test를 하지 말라!!

-test is not valid if n<20 nor (20<n<49 and expected freq of one or more cells < 5 )

• Yates adjustment (보정) : 꼭 읽어보자!! Read !!

2χ

homework

• 연습문제 10.4.6

10.5 동질성 검정 (homogeneity test)

• 동질성 검정: 각 각의 모집단에서 독립적으로 뽑은 표본들의 분포가 서로 동질의 것인가?

• Homogeneity test: Are two samples selected from one population?

• 독립성 검정 : 한 모집단에서 표본 추출, 행과 열의 합계는 조절이 아니고 우연히 나타난다.

• Independent test : selected from a population. Marginal totals are randomly determined.

• 독립성 검정 v.s. 동질성 검정

• Independent test vs. homogeneity test

•보기 10.5.1

• 가설 : 4개의 집단(1,2,3,4학년)에서 환각제 사용정도의 분포가 동일하다.

• Distributions of drug usage are the same (homogeneous) among 4 groups.

0H

( )22 2(4 1)(3 1)

1

0

19.4 12.592k

i i

i i

O E

EH

χ χ− −

=

−= = > =

∴

∑Reject

0H

class Drug usage

experimental casual Moderate to heavy

Freshman Sophomore Junior Senior

① test 2χ

χ− −

= =+ + + + ⋅ ⋅ ⋅

= >

2 22

0

( ) 220(60.72 40.48)( )( )( )( ) 108 112 100 1208.7302 3.841

n ad bca c b d a b c d

H∴Reject

•2Ⅹ2 table

∴probabilities of having the disease for two groups are significantly different.

②두 집단의 확률에 대한 비교

(Comparing two probabilities)

= = = =

× + ×= =

+

−= = > ∴

× ×+

= ≠

− − −=

− −+

1 1 2

0 1 2 1 2

1 1 2

1 2

ˆ100 .60 120 0.40

.60 100 .40 120 0.4909100 120

0.60 0.402.95469 1.96 significant.4903 .5091 .4903 .5091

100 120

: :

ˆ ( )

(1 ) (1 )

. .

a

n p n

p

Z

H p p H p p

p p pZ

p p p pn n

e g

2p̂

2p̂

homework

• 연습문제 10.5.4

•debatea.sas * File : debatea.sas ;

options ls=70 ps=55 nodate nonumber ;

data one; input id school gender compare

argue research reason speak ;

if school in (3,5,6,8) ;

label id='Survey Number'

school='High School'

compare='How Debate Compares to OthersClasses'

argue='Argumentation'

research='Research'

reason='Reasoning'

speak='Speaking' ;

cards; 1 6 1 1 1 1 1 1

108 7 1 1 1 1 1 2

56 3 1 1 1 1 1 1

,,,생략

70 6 1 1 1 1 1 1

69 6 2 1 1 1 1 1

;

run;

proc freq data=one; tables school*compare/chisq

expected ;

title 'Comparing Schools in the Debate Survey';

run;

proc freq data=one; tables school*compare/exact ;

title 'Comparing Schools in the Debate Survey';

run;

data respire; input treat $ outcome $ count ; cards; test f 40 test u 20 placebo f 16 placebo u 48; proc freq; weight count; tables treat*outcome/chisq; run;

SAS 시스템 FREQ 프로시저 treat * outcome 교차표 treat outcome 빈도| 백분율| 행 백분율| 칼럼 백분율|f |u | 총합 -----------+--------+--------+ placebo | 16 | 48 | 64 | 12.90 | 38.71 | 51.61 | 25.00 | 75.00 | | 28.57 | 70.59 | -----------+--------+--------+ test | 40 | 20 | 60 | 32.26 | 16.13 | 48.39 | 66.67 | 33.33 | | 71.43 | 29.41 | -----------+--------+--------+ 총합 56 68 124 45.16 54.84 100.00

treat * outcome 테이블에 대한 통계량 통계량 자유도 값 확률값 ---------------------------------------------------------- 카이제곱 1 21.7087 <.0001 우도비 카이제곱 1 22.3768 <.0001 연속성 수정 카이제곱 1 20.0589 <.0001 Mantel-Haenszel 카이제곱 1 21.5336 <.0001 파이 계수 -0.4184 분할 계수 0.3860 크래머의 V -0.4184 Fisher의 정확 검정 ---------------------------- (1,1) 셀 빈도(F) 16 하단측 p값 Pr <= F 2.838E-06 상단측 p값 Pr >= F 1.0000 테이블 확률 (P) 2.397E-06 양측 p값 Pr <= P 4.754E-06 표본 크기 = 124

data severe; input treat $ outcome $ count ; cards; Test f 10 Test u 2 Control f 2 Control u 4 ; proc freq order=data; tables treat*outcome / chisq nocol; weight count; run;

SAS 시스템

FREQ 프로시저

treat * outcome 교차표

treat outcome

빈도| 백분율| 행 백분율|f |u | 총합 -----------+--------+--------+ Test | 10 | 2 | 12 | 55.56 | 11.11 | 66.67 | 83.33 | 16.67 | -----------+--------+--------+ Control | 2 | 4 | 6 | 11.11 | 22.22 | 33.33 | 33.33 | 66.67 | -----------+--------+--------+ 총합 12 6 18 66.67 33.33 100.00

treat * outcome 테이블에 대한 통계량 통계량 자유도 값 확률값 ---------------------------------------------------------- 카이제곱 1 4.5000 0.0339 우도비 카이제곱 1 4.4629 0.0346 연속성 수정 카이제곱 1 2.5313 0.1116 Mantel-Haenszel 카이제곱 1 4.2500 0.0393 파이 계수 0.5000 분할 계수 0.4472 크래머의 V 0.5000 경고: 셀들의 75%가 5보다 작은 기대도수를 가지고 있습니다. 카이제곱 검정은 올바르지 않을 수 있습니다. Fisher의 정확 검정 ---------------------------- (1,1) 셀 빈도(F) 10 하단측 p값 Pr <= F 0.9961 상단측 p값 Pr >= F 0.0573 테이블 확률 (P) 0.0533 양측 p값 Pr <= P 0.1070 표본 크기 = 18

Exact Test

Table Cell

(1,1) (1,2) (2,1) (2,2) probabilities

12 0 0 6 .0001

11 1 1 5 .0039

10 2 2 4 .0533

9 3 3 3 .2370

8 4 4 2 .4000

7 5 5 1 .2560

6 6 6 0 .0498

Table Probabilities

• One-tailed p-value

• Two-tailed p-value

0.0533 0.0039 0.0001 0.0573p = + + =

0.0533 0.0039 0.0001 0.0498 0.1071p = + + + =

McNemar Test : Matched pairs

data one;

input hus_resp $ wif_resp $ no ;

datalines;

yes yes 20

yes no 5

no yes 10

no no 10

;run;

proc freq ;

tables hus_resp*wif_resp / agree ;

weight no ;

run;

“Ho : husband and wife 의 approval rates는 같다”를 기각하지 못함.

We do not reject “Ho : approval rates of husband and wife are the same”.

신뢰구간이 0을 포함하지 않으므로 K=0 이라는 귀무가설을 95% 신뢰수준에서 기각한다.

Kappa=1 >> perfect agreement, Kappa > 0.8 >> excellent agreement Kappa > 0.4 >> moderate agreement

CI does not include 0. -> we reject the null hypo. of K=0 by 95% confidence level.

# Chisq test by R filename: chisq.r

data <- matrix(c(25, 5, 15, 15), ncol=2, byrow=T)

data

data2 <- matrix(c(16, 11, 3, 21, 8, 1), ncol=2, byrow=T)

data2

chisq.test(data)

chisq.test(data2)

fisher.test(data2)

data <- matrix(c(6, 2, 8, 4), ncol=2, byrow=T)

data

mcnemar.test(data)

## From Agresti(2007) p.39

M <- as.table(rbind(c(762, 327, 468), c(484,239,477)))

dimnames(M) <- list(gender=c("M","F"),

party=c("Democrat","Independent", "Republican"))

M

colSums(M)

rowSums(M)

cbind(M,rowSums(M))

rbind(M,colSums(M))

prop.table(M, margin=2)*100



(Xsq <- chisq.test(M)) # Prints test summary

Xsq$observed # observed counts (same as M)

Xsq$expected # expected counts under the null

Xsq$residuals # Pearson residuals

sum((Xsq$residuals)**2)

1-pchisq(sum((Xsq$residuals)**2),

(ncol(M)-1)*(nrow(M)-1))

변수 종류에 따른 통계분석법 (statistical tests) 종속변수 (dep. Var) 독립변수 (indep var) 통계분석법 (tests)

연속변수 (혈압)

Conti. (BP)

명목척도(2개 범주)

Categorical (2 level) T test, paired T test

연속변수 (혈압)

Conti. (BP)

범주형 (3개 이상)

Categorical (>2 level) 분산분석(ANOVA)

범주형 (병 발생 여부)

Categorical (disease status)

범주형 (투약여부)

Categorical (treatments: A,B,C, etc)

카이제곱검정 (하나의 독립변수) Chi-square test (1 indep. Var)

로지스틱 회귀분석(둘 이상의 변수). Logistic regression (>1 indep var)

연속형 (아기의 체중)

Conti (weight)

연속형 (재태 임신기간)

Conti (gestation) 회귀분석 (regression analysis)

연속형 (출생 시 체중)

Conti (weight)

연속형 + 범주형

(재태기간 smoking 여부)

Conti+ categorical

(gestation + smk status)

공분산분석

(ANCOVA) : analysis of co-variance

생존시간 (연속형, >0)

Survival time (conti, >0)

연속형 + 범주형

나이 smoking 여부Conti+ categorical

(gestation + smk status)

생존분석 (survival analysis)

Characteristics of the data parametric Non-parametric

종속변수가 범주형

(dep=categorical)

카이제곱검정

chi-square test

Fisher’s exact test

Ncnemar test

Cochran’s Q

종속변수가 연속형

두 개의 독립된 집단

Dep=conti, two groups

T-test Wilcoxon rank sum test

Man-whitney median test

두개의 짝 지은 집단

Paired observations Paired t-test Wilcoxon signed rank test

세 개 이상의 집단

More than 2 groups ANOVA Kruscal-Wallis test

제3의 변수의 영향고려

Adjusting other variables 2-way ANOVA Friedman’s 2-way ANOVA

상관분석

Correlation

Pearson correlation

Spearman’s correlation

Kendall’s tau

Stuart’s tau

chapter 10 분포와 도수분석 -...

Documents