chapter 11 비모수 및 무분포통계학 - 서울대학교...

Chapter 11 비모수 및 무분포통계학

(nonparametric analysis)

2014/6/1

9.1 머리말 (introduction) • 모수적 방법

– 모집단의 분포를 가정

– 그 분포는 모수의 함수

– 모수를 알면 분포를 완전히 안다.

– 모수의 추정과 검정이 주요 문제 →모집단의 분포 가정이 틀리면 전체 논리가 다 틀리게 된다.

• Parametric approach * assumes dist’n of the pop * dist’n is the function of the parameters * Characteristics of the pop is determined by the parameters

* Estimation and testing of the parameters are main problems

* If the parametric assumptions are not valid, all the results of the analysis are questionable.

9.1 머리말 (introduction)

• 비모수적 방법; * 모집단의 분포를 가정하지 않음(무분포 방법) * data의 순위를 사용 * 모수 가정이 합리적인 경우 모수적 방법이 훨씬 더 효과적(efficient)

• Nonparametric approach * does not assumes the distributions of the pop (distribution-free method) * uses order of the data * If the parametric assumes are valid then parametric method is more efficient (smaller variance, less p-value)

data mean median

1,2,3,4,5 3 3

1,2,3,4,5,100 19 3.5

Median is robust to the outliers comparing to mean. (<-> sensitive)

median is the same if 100 -> 10000000

Nonparametric methods typically uses order of the data, not the value of the data.

Parametric vs. nonparametric methods

• 비모수적 방법은 자료의 (정규성) 분포가정을 하지 않는다

• Nonparametric methods are not dependent on parametric distributions.

• 자료의 평균과 분산이 아닌 순위를 이용한 방법을 사용한다.

• It typically uses ranks rather than the mean and variance.

• 자료의 분포가정 (eg 정규성)이 만족되면 효율이 떨어진다.

• If the distributional assumptions are valid, then nonparametric methods are less efficient (larger variance)

• Robust 한 결과를 준다. (outlier에 둔감)

• It is robust (not sensitive) to outliers

11.2 측정척도 (measurement scale)

• 명목척도(Nominal Scale) 남자, 여자, (male, female) 서울, 부산 (NY, LA)

• 서열척도(Ordinal Scale) 上, 中, 下 (high, medium, low)

• 구간척도(Interval Scale) 서열도 의미, 절대적 차이도 의미

• 비척도(Ratio Scale) 비율도 의미

11.3 부호검정(Sign Test)

•보기 11.3.1

•가설 Ho : 중위수=5 , Ha :중위수≠5

Median =5 , Median≠5

• Decision rule :P(+)>P(-) : enough # of +’s -> Reject :P(+)<P(-): enough # of -’s -> Reject

:P(+) ≠ P(-): > enough # of + or -’s -> Reject

보기11.3.1 에서 :(중위수=5) : P(+)>P(-) # of +’s out of 9 (exclude 1 obs (=0) out of 10 obs) ubder ~ Bin(9,1/2)

AH

0H0H

0HA

H

0H

AH

AH

AH

0H

Scores above(+) or below(-) the hypothesized median (5)

•Test statistic

( )

( )

( )

( )

−

−

=

= = =

= = − =

≤ = + = <

≤ = − ∑

9

1 9 1

0

9 10 0.001950 2

9 1 11 1 0.017581 2 2

1 0.00195 0.01758 0.0195 0.05

1 1| , 12 2

k x kx

k

P k

P k

P k

nP k x n p k

귀무가설 가정하에서 희귀한 일(통계적으로 일어날 수 없는 일)이 일어났다.

Very rare event under Ho -> Something wrong with Ho

∴ 가 잘못 되었다. Reject

Median is larger than 5. (P=0.0195) one-sided or Meain is not 5 (P=2*0.0195) two-sided

0H0H

data sign;

input score @@;

datalines;

4 5 8 8 9 6 10 7 6 6

;

run;

proc univariate mu0=5 ;

run;

2-sided

1-sided=.0391/2 =.0195

• 보기 11.3.2 (쌍을 이룬 집단 비교) paired data

• Hypothesis

: median of the difference is P(+)=P(-)

: median of the difference is negative P(+) < P(-) 0H

AH

Hygiene Score

instructed Not-instructed

• Test statistic : # of (+)

−

=

≤ = = ≥

∑112

0

11 1 1( 2 | 11, 0.5) 0.03272 2

( 9 | 11, 0.5)

k k

k

P kk

or P k (Table A)

data pair;

input edu noedu ;

diff=noedu-edu ;

datalines;

1.5 2.0

2.0 2.0

3.5 4.0

3.0 2.5

3.5 4.0

2.5 3.0

2.0 3.5

1.5 3.0

1.5 2.5

2.0 2.5

3.0 2.5

2.0 2.5

;run;

proc univariate ;

var diff ;

run;

2-sided

1-sided=.0654/2 =.03275

homework

• 연습문제 11.3.3

11.4 중위수 검정법(Median Test) • H0 :중위수(농촌)=중위수(도시)

Median(rural)=Median(urban)

# >= Median # < Median

urban rural

Mental health score

• 하에서는 2ⅹ2분할표의 row와 column이 독립

• Row and cloumn are independent under Ho

• ∴Do not reject 두 집단의 중위수는 동일하다.

Medians of two groups are not different.

( )

χ

χ

−= + + + +

× − ×= < =⋅ ⋅ ⋅

< >따라서

22

22

1

( )( )( )( )( )

28 6 4 10 8 = 2.33 3.84116 12 14 14

2.33 2.706 0.10

n ad bca c b d a b c d

p

0H

0H

11.5 Mann-Whitney test

• 가정 :두 집단의 sample size가 각각 n, m일때 ① 독립적이고 확률적으로 뽑았다. ② 서열적이다. ③ 두 집단은 같은 분포이고, 중위수만 다르다.

• Assumptions: samples are n, m, respectively. ① sampled independently and randomly. ② ordinal scale. ③ different only by the medians. Shapes are exactly the same

•보기 11.5.1

Exposed Unexposed

rank rank

Rank sum of X

11.5 Mann-Whitney test

• Hypothesis ①one-sided

• Decision rule If rank sum of X(s) is small (same as small T) then Ho is rejected

0 :

:X Y

X YA

H M M

H M M

≥

<

⋅= − =

+= − 15 16

145 252

( 1)2

n nT S

S=rank sum of x n=sample size of x

m=sample size of y

T=25<45 -> reject Ho

(표 J ) n=15, m=10, alpha=0.05

• Hypothesis ②2-sided:

• Rejection region if rank-sum of x is too big or too small

• 보기 11.5.1 Reject Ho if T<40 or T>110 T=25<40 -> Ho of 2-sided is rejected.

=

≠0 :

:X Y

X YA

H M M

H M M

0.052 2

2 2

40W W

T W or T nm W

α

α α

= =

< > −

(헤모글로빈 수가 변한다.)

(Hemoglobin levels are different)

• n,m>20 -> no table like table J -> approximation standardized form -> use normal dist’n.

• 가설 ①2-sided :

2

( 1)/12

T nmZ

nm n m

−=

+ +

0 :

:X Y

X YA

H M M

H M M

=

≠

01 2if Z Z then reject Hα−

>

②1-sided : Reject Ho if rank-sum of X is small enough. Reject Ho if rank-sum of X is big enough.

0 :

:X Y

X YA

H M M

H M M

≥

<

01if Z Z then reject Hα−< −

0 :

:X Y

X YA

H M M

H M M

≤

>

01if Z Z then reject Hα−>

SAS proc npar1way with wilcoxon option for two-sample data will perform Mann-Whitney test. This is

often called Wilcoxon-Mann-Whitney test.

11.6 Kolmogorov-Smirnov (K-S) goodness-of-fit test

• Are cumulative dist’ns the same? ⇔Are dist’ns of two pops the same?

• 검정통계량 (test stat)

= ≤

= ≤

=

≠

모집단

0

ˆ ( ) : Pr( )

( ) : Pr( )

: ( ) ( )

: ( ) ( )

S S

T T

TS

TSA

F x x x

F x X x

H F x F x

H F x F x표본누적분포함수

누적분포함수

= −ˆsup | ( )Sx

D F x ˆ ( ) |TF x

(pop) Cumulative dist’n ft

sample cumulative dist’n ft

•계산방법 , 보기 11.6.1 공복시 혈당량이 정규분포를 따르는가 ? Glucose level ~ normal dist’n ?

• K-S검정법의 가정 (Assumptions of K-S test) ① 그 표본은 확률표본이다. (random sample) ② 가설상의 분포 는 연속이다. (continuous dist’n) ( )TF x

•보기 11.6.1

• H0: data came from N(80,62) D=0.1547 <0.174(p582) p>0.20 ∴not significant.

homework

• 연습문제 11.6.2

11.7 Kruskal-Wallis One-way ANOVA

• 가정 H0: k개의 집단은 같은 분포에서 나왔다. HA: 적어도 하나의 집단은 다른 집단과 다른 분포(큰 값 혹은 작은값)에서 나왔다.

• Assumptions

H0 : k samples from the same distributions

HA : one or more sample from distribution with larger or smaller location parameter

•보기 11.7.1 2

21

12 3( 1) ~( 1)

jk

j

RH n

n n nχ

−= − +

+ ∑

( )

2 2 212 55 26 10 3(13 1)13 13 1 5 4 4

10.68 7.76( ) 0.009

H

table L p

= + + − + +

= > ⇒ <

11.7 Kruskal-Wallis One-way ANOVA

• H0하에서는 각 집단에서의 순위합 들은 비슷하다. 원래는 의 형태이고 값들이 비슷하면 값이 작아지므로 Ho를 reject 못한다.

• rank-sums are similar under Ho

• If ‘s are similar then are small -> H is small, we cannot reject Ho

1 2, , , kR R R

( )2iR R−∑ iR

( )2iR R−∑

1 2, , , kR R R

iR ( )2iR R−∑

•보기 11.7.2

• 다섯 가지 형태에 따른 병상 당 설치 자본의 순수가치 2

4

51 2 3 4

0

36.39~

10, 8 , 9 , 7 , 7

H

k k k k k

reject H

χ=

= = = = =

SAS proc npar1way for >2 leveled data will perform Kruskal-Wallis test.

Net book value of equipment per bed by hospital type

homework

• 연습문제 11.7.2

11.8 Friedman’s 2-way ANOVA

• 보기 11.8.1

Physical therapists’ ranks of three low-volt electrical simulators

( )22

1

0

12 3 ( 1)( 1)

8.222 0.016 ( )

k

r jj

R n knk k

p table Ma

reject H

χ=

= − ++

= =

∴

∑

교과서 고쳐주세요!!

11.9.1 Spearman rank correlation coefficient

• 양측검정 H0 : X와 Y는 서로 독립적이다. HA : X와 Y는 독립적이 아니다.

• 단측검정 H0 : X와 Y는 서로 독립적이다. HA : X와 Y는 정비례 H0 : X와 Y는 서로 독립적이다. HA : X와 Y는 반비례

• 2-sided H0 : X and Y are indep. HA : X and Y are not indep.

• 1-sided H0 : X and Y are indep. HA : X and Y: + association H0 : X and Y are indep. HA : X and Y: - association

• 보기 11.9.1

Age and EEG

= − < − <= −−

∴

∑ 2

0.76 0.6586 0.001

0

61

( 1)i

ps

dr

n nreject H

•가설검정의 순서 ① X,Y 따로 순위를 준다. ② di=순위(xi)-순위(Yi) ③ 을 구한다. ∑ 2

id

•반비례의 관계가 있다면 가 커지고 rs가 작아진다. 그러므로 작은 rs를 얻으면 H0 를 기각한다. ∴X와 Y는 반비례관계가 있다고 결론

(table N)

∑ 2id

= − < − <= −−

∴

∑ 2

0.76 0.6586 0.001

0

61

( 1)i

ps

dr

n nreject H

•가설검정의 순서 ① X,Y 따로 순위를 준다. ② di=순위(xi)-순위(Yi) ③ 을 구한다. ∑ 2

id

•반비례의 관계가 있다면 가 커지고 rs가 작아진다. 그러므로 작은 rs를 얻으면 H0 를 기각한다. ∴X와 Y는 반비례관계가 있다고 결론

(table N)

∑ 2id

•steps ① rank X, Y seperately. ② di=rank(xi)-rank(Yi) ③ calculate ∑ 2

id

• negative association -> large -> small rs

• small rs-> Reject H0 ∴ We conclude negative association between X and Y

∑ 2id

• 보기 11.9.2(n>30일 경우)

= = − = >

∴ 0

0.75 1 4.37 1.96Sr Z r ns

reject H

• Z가 너무 크거나(반비례관계) 들이 크고 Z가 너무 작거나(비례관계) 들이 작고

α−> 01 2

if Z Z then reject H

2id

2id

• larger Z (- asso) larger smaller Z(+asso) smaller

2id2id

homework

• 연습문제 11.9.2

•Mod20.sas /* File name : mod20.sas Nonparametric One-Way Anova */ options pageno=1 nodate ls=130

ps=60 nocenter; filename inbrakes

'd:\myweb\intro\taillite.dat'; data one; infile inbrakes ; input id vehtype group positn

speedzn resptime follotme folltmec;

if group=1; label vehtype='Vehicle Type' group='Group - Light On=1

Light Off=2' positn='Light Position' speedzn='Speed Zone' resptime='Response Time' follotme='Following Time

in Vedio Frames' folltmec='Following Time

in Categories‘; run; proc sort; by vehtype;

/* Let's do one-way ANOVA to see the effect of vehicle type */

proc anova; class vehtype;

model resptime=vehtype;

title 'Parametric ANOVA analysis';

run;

/* What's wrong with this ?

We didn't check the normality assumption.

Let's do proc univariate to check the normality*/

proc univariate normal plot; var resptime;

by vehtype;

title 'Normality Check';

run;

/* NOT NORMALLY DISTRIBUTED >> NONPARAMETRIC ANOVA */ proc npar1way wilcoxon; class vehtype; var resptime ; title 'Nonpara One-Way ANOVA for

Tail Light Study'; run; /* The other way is transformation. Let's take log transformation

so that we have normal distribition.*/ data t; set one; t=log(resptime); label t='ln (response time)'; run; proc sort; by vehtype; proc univariate normal plot; var t; by vehtype; title 'Normality Check for

transformed variable'; run;

/* The transformed variable seems to normally ditributed.

Then we can do parametric ANOVA with normality assumption

*/

proc anova; class vehtype;

model t=vehtype;

title 'ANOVA for the log transformed response time';

run;

자료의 정규성 검정 (SAS 예제)

data ; input diameter @@; label diameter='Diameter in mm'; datalines; 5.501 5.251 5.404 5.366 5.445 5.576 5.607 5.200 5.977 5.177 ... ; run; proc univariate data=rods normal; histogram diameter / normal (mu=est sigma=est) midpoints = 5 to 6.30 by 0.15; run;

proc univariate data=rods noprint;

probplot diameter / normal (mu=est sigma=est);

run;

Skewed to the right

Box plot

What statistical analysis should I use?

Statistical analyses using SAS

• http://www.ats.ucla.edu/stat/sas/whatstat/whatstat.htm

chapter 11 비모수 및 무분포통계학 - 서울대학교...

Documents