intro to r vietnamese 2

358
  1 Phân tích dữ  liu và to biu đồ bng R Nguyn Văn Tun Nhà xut bn Khoa hc và K  thut Thành ph H Chí Minh - 2006

Upload: rin-mai

Post on 15-Jul-2015

228 views

Category:

Documents


4 download

TRANSCRIPT

Phn tch d liu v to biu bng RNguyn Vn Tun

Nh xut bn Khoa hc v K thut Thnh ph H Ch Minh - 2006

1

Mc lc1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Li ni u Gii thiu ngn ng R Nhp d liu Bin tp d liu Tnh ton n gin v ma trn Tnh ton xc sut v m phng Kim nh gi thuyt v tr s R Phn tch s liu bng biu Phn tch thng k m t Phn tch hi qui tuyn tnh Phn tch phng sai Phn tch hi qui logistic Phn tch bin c (survival analysis) Phn tch tng hp (meta-analysis) Thit k th nghim c tnh c mu Lp trnh v vit hm bng R Mt s lnh thng thng trong R Thut ng dng trong sch Li bt

2

Phn tch d liu v to biu bng R Nguyn Vn Tun

1 Li ni uTri vi quan im ca nhiu ngi, thng k l mt b mn khoa hc: Khoa hc thng k (Statistical Science). Cc phng php phn tch d da vo nn tng ca ton hc v xc sut, nhng ch l phn k thut, phn quan trng hn l thit k nghin cu v din dch ngha d liu. Ngi lm thng k, do , khng ch l ngi n thun lm phn tch d liu, m phi l mt nh khoa hc, mt nh suy ngh (thinker) v nghin cu khoa hc. Chnh v th, m khoa hc thng k ng mt vai tr cc k quan trng, mt vai tr khng th thiu c trong cc cng trnh nghin cu khoa hc, nht l khoa hc thc nghim. C th ni rng ngy nay, nu khng c thng k th cc th nghim gen vi triu triu s liu ch l nhng con s v hn, v ngha. Mt cng trnh nghin cu khoa hc, cho d c tn km v quan trng c no, nu khng c phn tch ng phng php s khng c ngha khoa hc g c. Chnh v th m ngy nay, ch cn nhn qua tt c cc tp san nghin cu khoa hc trn th gii, hu nh bt c bi bo y hc no cng c phn Statistical Analysis (Phn tch thng k), ni m tc gi phi m t cn thn phng php phn tch, tnh ton nh th no, v gii thch ngn gn ti sao s dng nhng phng php hm bo k hay tng trng lng khoa hc cho nhng pht biu trong bi bo. Cc tp san y hc c uy tn cng cao yu cu v phn tch thng k cng nng. Xin nhc li nhn mnh: khng c phn phn tch thng k, bi bo khng c ngha khoa hc. Mt trong nhng pht trin quan trng nht trong khoa hc thng k l ng dng my tnh cho phn tch v tnh ton thng k. C th ni khng ngoa rng khng c my tnh, khoa hc thng k vn ch l mt khoa hc bun t kh khan, vi nhng cng thc rc ri m thiu tnh ng dng vo thc t. My tnh gip khoa hc thng k lm mt cuc cch mng ln nht trong lch s ca b mn: l a khoa hc thng k vo thc t, gii quyt cc vn gai gc nht v gp phn lm pht trin khoa hc thc nghim. Ngi vit cn nh hn 20 nm v trc khi cn l mt sinh vin theo hc chng trnh thc s thng k c, mt v gio s kh knh k mt cu chuyn v nh thng k danh ting ngi M, Fred Mosteller, nhn c mt hp ng nghin cu t B Quc phng M ci tin chnh xc ca v kh M vo thi Th chin th II, m trong ng phi gii mt bi ton thng k gm khong 30 thng s. ng phi mn 20 sinh vin sau i hc lm vic ny: 10 sinh vin ch vic sut ngy tnh ton bng tay; cn 10 sinh vin khc kim tra li tnh ton ca 10 sinh vin kia. Cng vic ko di gn mt thng tri.

3

Ngy nay, vi mt my tnh c nhn (personal computer) khim tn, phn tch thng k c th gii trong vng trn di 1 giy. Nhng nu my tnh m khng c phn mm th my tnh cng ch l mt ng st hay silicon v hn v v dng. Mt phn mm , ang v s lm cch mng thng k l R. Phn mm ny c mt s nh nghin cu thng k v khoa hc trn th gii pht trin v hon thin trong khong 10 nm qua s dng cho vic hc tp, ging dy v nghin cu. Cun sch ny s gii thiu bn c cch s dng R cho phn tch thng k v th. Ti sao R? Trc y, cc phn mm dng cho phn tch thng k c pht trin v kh thng dng. Nhng phn mm ni ting t thi xa xa nh MINITAB, BMD-P n nhng phn mm tng i mi nh STATISTICA, SPSS, SAS, STAT, v.v thng rt t tin (gi cho mt i hc c khi ln n hng trm ngn -la hng nm), mt c nhn hay thm ch cho mt i hc khng kh nng mua. Nhng R thay i tnh trng ny, v R hon ton min ph. Tri vi cm nhn thng thng, min ph khng c ngha l cht lng km. Tht vy, chng nhng hon ton min ph, R cn c kh nng lm tt c (xin ni li: tt c), thm ch cn hn c, nhng phn tch m cc phn mm thng mi lm. R c th ti xung my tnh c nhn ca bt c c nhn no, bt c lc no, v bt c u trn th gii. Ch vi pht ci t l R c th a vo s dng. Chnh v th m i a s cc i hc Ty phng v th gii cng ngy cng chuyn sang s dng R cho hc tp, nghin cu v ging dy. Trong xu hng , cun sch ny c mt mc tiu khim tn l gii thiu n bn c trong nc kp thi cp nht ha nhng pht trin v tnh ton v phn tch thng k trn th gii. Cun sch ny c son ch yu cho sinh vin i hc v cc nh nghin cu khoa hc, nhng ngi cn mt phn mm hc thng k, phn tch s liu, hay v th t s liu khoa hc. Cun sch ny khng phi l sch gio khoa v l thuyt thng k, hay nhm ch bn c cch lm phn tch thng k, nhng s gip bn c lm phn tch thng k hu hiu hn v ho hng hn. Mc ch chnh ca ti l cung cp cho bn c nhng kin thc c bn v thng k, v cch ng dng R cho gii quyt vn , v qua lm nn tng bn c tm hiu hay pht trin thm R. Ti cho rng, cng nh bt c ngnh ngh no, cch hc phn tch thng k hay nht l t mnh lm phn tch. V th, sch ny c vit vi rt nhiu v d v d liu thc. Bn c c th va c sch, va lm theo nhng ch dn trong sch (bng cch g cc lnh vo my tnh) v s thy ho hng hn. Nu bn c c sn mt d liu nghin cu ca chnh mnh th vic hc tp s hu hiu hn bng cch ng dng ngay nhng php tnh trong sch. i

4

Phn tch d liu v to biu bng R Nguyn Vn Tun

vi sinh vin, nu cha c s liu sn, cc bn c th dng cc phng php m phng (simulation) hiu thng k hn. Khoa hc thng k nc ta tng i cn mi, cho nn mt s thut ng cha c din dch mt cch thng nht v hon chnh. V th, bn c s thy y trong sch mt vi thut ng l, v trong trng hp ny, ti c gng km theo thut ng gc ting Anh bn c tham kho. Ngoi ra, trong phn cui ca sch, ti c lit k cc thut ng Anh Vit c cp n trong sch. Tt c cc d liu v m s dng trong sch ny u c th ti t internet xung my tnh c nhn, hay c th truy nhp trc tip qua trang web: http://www.r.ykhoanet.com. Ti hi vng bn c s tm thy trong sch mt vi thng tin b ch, mt vi k thut hay php tnh c ch cho vic hc tp, ging dy v nghin cu ca mnh. Nhng c l chng c cun sch no hon thin hay khng c thiu st; thnh ra, nu bn c pht hin mt sai st trong sch, xin bo cho ti bit qua in th [email protected] hay [email protected]. Thnh tht cm n cc bn c trc. Ti mun nhn dp ny cm n Tin s Nguyn Hong Dzng thuc khoa Ha, i hc Bch khoa Thnh ph H Ch Minh, ngi gi v gip ti in cun sch ny trong nc. Ti cm n Bc s Nguyn nh Nguyn, ngi c mt phn ln bn tho ca cun sch, gp nhiu kin thit thc, v thit k ba sch. Ti cng cm n Vy Lan, bin tp vin ca Nh xut bn Khoa hc v K thut, chu kh c k bn tho, ch ra nhng ch cha r vit li, v thng cm gi li nhng cu vn y c tnh ca tc gi. By gi, ti mi bn c cng i vi ti mt hnh trnh thng k ngn bng R. Sydney, ngy 31/3/2006 Nguyn Vn Tun

5

2 Gii thiu ngn ng R2.1 R l g ?Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch thng k v th. Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn cho mt vn tnh ton c bit. Hai ngi sng to ra R l hai nh thng k hc tn l Ross Ihaka v Robert Gentleman. K t khi R ra i, rt nhiu nh nghin cu thng k v ton hc trn th gii ng h v tham gia vo vic pht trin R. Ch trng ca nhng ngi sng to ra R l theo nh hng m rng (Open Access). Cng mt phn v ch trng ny m R hon ton min ph. Bt c ai bt c ni no trn th gii u c th truy nhp v ti ton b m ngun ca R v my tnh ca mnh s dng. Cho n nay, ch qua cha y 5 nm pht trin, nhng c nhiu cc nh thng k hc, ton hc, nghin cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c mt mng li gn mt triu ngi s dng R, v con s ny ang tng theo cp s nhn. C th ni trong vng 10 nm na, chng ta s khng cn n cc phn mm thng k t tin nh SAS, SPSS hay Stata (cc phn mm ny gi c th ln n 100.000 USD mt nm) phn tch thng k na, v tt c cc phn tch c th tin hnh bng R. V th, nhng ai lm nghin cu khoa hc cn nn hc cch s dng R cho phn tch thng k v th. Chng ny s hng dn bn c cch s dng R.

2.2 Ti R xung v ci t vo my tnh s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh. lm vic ny, ta phi truy nhp vo mng v vo website c tn l Comprehensive R Archive Network (CRAN) sau y: http://cran.R-project.org.

6

Phn tch d liu v to biu bng R Nguyn Vn Tun

Ti liu cn ti v, ty theo phin bn, nhng thng c tn bt u bng mu t R v s phin bn (version). Chng hn nh phin bn m tc gi s dng vo cui nm 2005 l 2.2.1, nn tn ca ti liu cn ti l: R-2.2.1-win32.zip Ti liu ny khong 26 MB, v a ch c th ti l: http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch s dng R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh, ti liu ny c th cung cp nhng thng tin cn thit s dng m khng cn phi c cc ti liu khc. Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh. lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm theo hng dn cch ci t trn mn hnh.

2.3 Package cho cc phn tch c bitR cung cp cho chng ta mt ngn ng my tnh v mt s function lm cc phn tch cn bn v n gin. Nu mun lm nhng phn tch phc tp hn, chng ta cn phi ti v my tnh mt s package khc. Package l mt phn mm nh c cc nh thng k pht trin gii quyt mt vn c th, v c th chy trong h thng R. Chng hn nh phn tch hi qui tuyn tnh, R c function lm s dng cho mc ch ny, nhng lm cc phn tch su hn v phc tp hn, chng ta cn n cc package nh lme4. Cc package ny cn phi c ti v my tnh v ci t. a ch ti cc package vn l: http://cran.r-project.org, ri bm vo phn Packages xut hin bn tri ca mc lc trang web. Mt s package cn ti v my tnh s dng cho cc v d trong sch ny l:

7

Tn package lattice Hmisc Design Epi epitools foreign Rmeta meta survival splines Zelig genetics gap BMA leaps

Chc nng Dng v th v lm cho th p hn Mt s phng php m hnh d liu ca F. Harrell Mt s m hnh thit k nghin cu ca F. Harrell Dng cho cc phn tch dch t hc Mt package khc chuyn cho cc phn tch dch t hc Dng nhp d liu t cc phn mm khc nh SPSS, Stata, SAS, v.v Dng cho phn tch tng hp (meta-analysis) Mt package khc cho phn tch tng hp Chuyn dng cho phn tch theo m hnh Cox (Coxs proportional hazard model) Package cho survival vn hnh Package dng cho cc phn tch thng k trong lnh vc x hi hc Package dng cho phn tch s liu di truyn hc Package dng cho phn tch s liu di truyn hc Bayesian Model Average Package dng cho BMA

2.4 Khi ng v ngng chy RSau khi hon tt vic ci t, mt icon s xut hin trn desktop ca my tnh. n y th chng ta sn sng s dng R. C th nhp chut vo icon ny v chng ta s c mt ca s nh sau:R 2.2.1.lnk

8

Phn tch d liu v to biu bng R Nguyn Vn Tun

R thng c s dng di dng "command line", c ngha l chng ta phi trc tip g lnh vo ci prompt mu trn. Cc lnh phi tun th nghim ngt theo vn phm v ngn ng ca R. C th ni ton b bi vit ny l nhm hng dn bn c hiu v vit theo ngn ng ca R. Mt trong nhng vn phm ny l R phn bit gia Library v library. Ni cch khc, R phn bit lnh vit bng ch hoa hay ch thng. Mt vn phm khc na l khi c hai ch ri nhau, R thng dng du chm thay vo khong trng, chng hn nh data.frame,t.test,read.table,v.viu ny rt quan trng, nu khng s lm mt th gi ca ngi s dng. Nu lnh g ra ng vn phm th R s cho chng ta mt ci prompt khc hay cho ra kt qu no (ty theo lnh); nu lnh khng ng vn phm th R s cho ra mt thng bo ngn l khng ng hay khng hiu. V d, nu chng ta g:> x

Nhng nu chng ta g:> R is great

R s khng ng vi lnh ny, v ngn ng ny khng c trong th vin ca R, mt thng bo sau y s xut hin:Error: syntax error >

9

Khi mun ri khi R, chng ta c th n gin nhn nt cho (x) bn gc tri ca ca s, hay g lnh q().

2.5 Vn phm ngn ng RVn phm chung ca R l mt lnh (command) hay function (thnh thong cp n l hm). M l hm th phi c thng s; cho nn theo sau hm l nhng thng s m chng ta phi cung cp. Chng hn nh:> reg setwd(c:/works/stats)

th setwd l mt hm, cn c:/works/stats l thng s ca hm. bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args vit tt ch arguments) m trong x l mt hm chng ta cn bit:> args(lm) function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) NULL

R l mt ngn ng i tng (object oriented language). iu ny c ngha l cc d liu trong R c cha trong object. nh hng ny cng c vi nh hng n cch vit ca R. Chng hn nh thay v vit x = 5 nh thng thng chng ta vn vit, th R yu cu vit l x == 5. i vi R, x = 5 tng ng vi x # lnh sau y s m phng 10 gi tr normal > x myobject my object my.object My.object.u my.object.L My.object.u + my.object.L [1] 20

Mt vi iu cn lu khi t tn trong R l:

11

Khng nn t tn mt bin s hay variable bng k hiu _ (underscore) nh my_object hay my-object. Khng nn t tn mt object ging nh mt bin s trong mt d liu. V d, nu chng ta c mt data.frame (d liu hay dataset) vi bin s age trong , th khng nn c mt object trng tn age, tc l khng nn vit: age ?lm

Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch c c v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh. Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua phn ch dn c sn trong R bng cch chn mc help v sau chn Html help nh hnh di y bit thm chi tit. Bn c cng c th copy v dn cc lnh trong mc ny vo R xem cho bit cch vn hnh ca R.

12

Phn tch d liu v to biu bng R Nguyn Vn Tun

Thay v chn mc trn, bn c cng c th n gin lnh:

> help.start()v mt ca s s xut hin ch dn ton b h thng R. Hm apropos cng rt c ch v n cung cp cho chng ta tt c cc hm trong R bt u bng k t m chng ta mun tm. Chng hn nh chng ta mun bit hm no trong R c k t lm th ch n gin lnh: > apropos(lm) V R s bo co cc hm vi k t lm nh sau c sn trong R:[1] ".__C__anova.glm" ".__C__glm" [4] ".__C__glm.null" ".__C__mlm" [7] "anova.glm" "anova.lm" [10] "anova.lmlist" "anovalist.lm" [13] "contr.helmert" "glm.control" [16] "glm.fit" "hatvalues.lm" [19] "KalmanForecast" "KalmanRun" [22] "KalmanSmooth" ".__C__anova.glm.null" ".__C__lm" "anova.glmlist" "anova.mlm" "glm" "glm.fit.null" "KalmanLike" "lm" "lm.fit"

13

[25] "lm.fit.null" "lm.wfit" [28] "lm.wfit.null" "model.frame.lm" [31] "model.matrix.lm" [34] "plot.lm" "predict.glm" [37] "predict.lm" "print.glm" [40] "print.lm" "residuals.lm" [43] "rstandard.glm" "rstudent.glm" [46] "rstudent.lm" "summary.lm" [49] "summary.mlm"

"lm.influence" "model.frame.glm" "nlm" "plot.mlm" "predict.mlm" "residuals.glm" "rstandard.lm" "summary.glm" "kappa.lm" "nlminb"

2.8 Mi trng vn hnhD liu phi c cha trong mt khu vc (directory) ca my tnh. Trc khi s dng R, c l cch hay nht l to ra mt directory cha d liu, chng hn nh c:\works\stats. R bit d liu nm u, chng ta s dng lnh setwd (set working directory) nh sau:> setwd(c:/works/stats)

Lnh trn bo cho R bit l d liu s cha trong directory c tn l c:\works\stats. Ch rng, R dng forward slash / ch khng phi backward slash \ nh trong h thng Windows. Ch rng R c kh nng c d liu trc tip t mng (t cc website). Do , chng ta cng c th dng lnh setwd bo cho R bit rng chng ta lm vic trc tip trn mng nh trong lnh sau y:> setwd("http://www.r.ykhoanet.com/")

bit hin nay, R ang lm vic directory no, chng ta ch cn lnh:> getwd() [1] "C:/Program Files/R/R-2.2.1"

Ci prompt mc nh ca R l >. Nhng nu chng ta mun c mt prompt khc theo c tnh c nhn, chng ta c th thay th :> options(prompt=R> ) R>

14

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hay:> options(prompt="Tuan> ") Tuan>

Mn nh R mc nh l 80 k t (characters), nhng nu chng ta mun mn nh rng hn, th ch cn ra lnh:> options(width=100)

Hay mun R trnh by cc s liu dng 3 s thp phn:> options(scipen=3)

Cc la chn v thay i ny c th dng lnh options(). bit cc thng s hin ti ca R l g, chng ta ch cn lnh:> options()

Tm hiu ngy thng:> Sys.Date() [1] "2006-03-31"

Nu bn c cn thm thng tin, mt s ti liu trn mng (vit bng ting Anh) cng rt c ch. Cc ti liu ny c th ti xung my min ph: R for beginners (ca Emmanuel Paradis): http://cran.r-project.org/doc/contrib/rdebuts_en.pdf Using R for data analysis and graphics (ca John Maindonald): http://cran.r-project.org/doc/contrib/usingR.pdf Ngoi ra, tc gi cng c mt ti liu bng ting Vit (di 114 trang) tm lc cc lnh hay s dng trong R ti website: www.r.ykhoanet.com.

15

3 Nhp d liuMun lm phn tch d liu bng R, chng ta phi c sn d liu dng m R c th hiu c x l. D liu m R hiu c phi l d liu trong mt data.frame. C nhiu cch nhp s liu vo mt data.frame trong R, t nhp trc tip n nhp t cc ngun khc nhau. Sau y l nhng cch thng dng nht:

3.1 Nhp s liu trc tip: c()V d 1: chng ta c s liu v tui v insulin cho 10 bnh nhn nh sau, v mun nhp vo R.50 62 60 40 48 47 57 70 48 67 16.5 10.8 32.3 19.3 14.2 11.3 15.5 15.8 16.2 11.2

Chng ta c th s dng function c tn c nh sau:> age insulin setwd(c:/works/stats) > save(tuan, file=tuan.rda)

Lnh u tin (setwd ch wd c ngha l working directory) cho R bit rng chng ta mun lu cc s liu trong directory c tn l c:\works\stats. Lu

17

rng thng thng h thng Windows dng du \ (backward slash), nhng trong R chng ta dng du / (forward slash). Lnh th hai (save) cho R bit rng cc s liu trong i tng tuan s lu trong file c tn l tuan.rda). Sau khi g xong hai lnh trn, mt file c tn tuan.rda s c mt trong directory .

3.2 Nhp s liu trc tip: edit(data.frame())V d 1 (tip tc): chng ta c th nhp s liu v tui v insulin cho 10 bnh nhn bng mt function rt c ch, l: edit(data.frame()). Vi function ny, R s cung cp cho chng ta mt ca s mi vi mt dy ct v dng ging nh Excel, v chng ta c th nhp s liu trong bng . V d:> ins setwd(c:/works/stats) > chol chol

hay> names(chol)

R s cho bit c cc ct nh sau trong d liu (name l lnh hi trong d liu c nhng ct no v tn g):[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg"

19

By gi chng ta c th lu d liu di dng R x l sau ny bng cch ra lnh:> save(chol, file="chol.rda")

3.4 Nhp s liu t Excel: read.csv nhp s liu t phn mm Excel, chng ta cn tin hnh 2 bc: Bc 1: Dng lnh Save as trong Excel v lu s liu di dng csv; Bc 2: Dng R (lnh read.csv) nhp d liu dng csv.

V d 3: Mt d liu gm cc ct sau y ang c lu trong Excel, v chng ta mun chuyn vo R phn tch. D liu ny c tn l excel.xls.ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Age 18 28 20 21 28 23 20 20 20 20 22 27 26 33 34 32 28 18 26 27 Sex Ethnicity 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 4 1 1 1 1 1 2 1 1 3 1 1 2 2 2 IGFI 148.27 114.50 109.82 112.13 102.86 129.59 142.50 118.69 197.69 163.69 144.81 141.60 161.80 89.20 161.80 148.50 157.70 222.90 186.70 167.56 IGFBP3 5.14 5.23 4.33 4.38 4.04 4.16 3.85 3.44 4.12 3.96 3.63 3.48 4.10 2.82 3.80 3.72 3.98 3.98 4.64 3.56 ALS 316.00 296.42 269.82 247.96 240.04 266.95 300.86 277.46 335.23 306.83 295.46 231.20 244.80 177.20 243.60 234.80 224.80 281.40 340.80 321.12 PINP 61.84 98.64 93.26 101.59 58.77 48.93 135.62 79.51 57.25 74.03 68.26 56.78 75.75 48.57 50.68 83.98 60.42 74.17 38.05 30.18 ICTP 5.81 4.96 7.74 6.66 4.62 5.32 8.78 7.19 6.21 4.95 4.54 4.47 6.27 3.58 3.52 4.85 4.89 6.43 5.12 4.78 P3NP 4.21 5.33 4.56 4.61 4.95 3.82 6.75 5.11 4.44 4.84 3.70 4.07 5.26 3.68 3.35 3.80 4.09 5.84 5.77 6.12

Vic u tin l chng ta cn lm, nh ni trn, l vo Excel lu d liu di dng csv:

20

Phn tch d liu v to biu bng R Nguyn Vn Tun

Vo Excel, chn File

Save as

Chn Save as type CSV (Comma delimited)

Sau khi xong, chng ta s c mt file vi tn excel.csv trong directory c:\works\stats. Vic th hai l vo R v ra nhng lnh sau y:> setwd(c:/works/stats) > gh save(gh, file="gh.rda")

3.5 Nhp s liu t mt SPSS: read.spssPhn mm thng k SPSS lu d liu di dng sav. Chng hn nh nu chng ta c mt d liu c tn l testo.sav trong directory c:\works\stats, v mun chuyn d liu ny sang dng R c th hiu c, chng ta cn s dng lnh read.spss trong package c tn l foreign. Cc lnh sau y s hon tt d dng vic ny: Vic u tin chng ta cho truy nhp foreign bng lnh library:> library(foreign)

Vic th hai l lnh read.spss:

21

> setwd(c:/works/stats) > testo save(testo, file="testo.rda")

3.6 Thng tin c bn v d liuGi d nh chng ta nhp s liu vo mt data.frame c tn l chol nh trong v d 1. tm hiu xem trong d liu ny c g, chng ta c th nhp vo R nh sau: Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi arg l tn ca d liu..

> attach(chol)

Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh is.data.frame(arg) vi arg l tn ca d liu. V d:

> is.data.frame(chol) [1] TRUER cho bit chol qu l mt data.frame. C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):

> dim(chol) [1] 50 8 Nh vy, chng ta c 50 dng v 8 ct (hay bin s). Vy nhng bin s ny tn g? Chng ta dng lnh names(arg) vi arg l tn ca d liu. V d:

22

Phn tch d liu v to biu bng R Nguyn Vn Tun

> names(chol) [1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc"

"tg"

Trong bin s sex, chng ta c bao nhiu nam v n? tr li cu hi ny, chng ta c th dng lnh table(arg) vi arg l tn ca bin s. V d:

> table(sex)

sex nam Nam 1 21

Nu 28

Kt qu cho thy d liu ny c 21 nam v 28 n. Trn y l vi cch nhp d liu vo R.Trong thc t, R c th c d liu t rt nhiu phn mm thng dng, k c cc phn mm thng k nh SPSS (m chng ta xem qua), SAS, STATA, v.v Nhng c d liu t cc phn mm ny, bn c cn phi ti package foreign v my v ci t vo R. Package foreign c th ti t website chnh thc ca R.

23

4 Bin tp d liuBin tp s liu y khng c ngha l thay i s liu gc (v l mt ti ln, mt s gian di trong khoa hc khng th chp nhn c), m ch c ngha t chc s liu sao cho R c th phn tch mt cch hu hiu. Nhiu khi trong phn tch thng k, chng ta cn phi tp trung s liu thnh mt nhm, hay tch ri thnh tng nhm, hay thay th t k t (characters) sang s (numeric) cho tin vic tnh ton. Chng ny s bn qua mt s lnh cn bn cho vic bin tp s liu. Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v hiu cu chuyn, xin nhc li rng chng ta nhp s liu vo trong mt d liu R c tn l chol t mt text file c tn l chol.txt:> setwd(c:/works/stats) > chol attach(chol)

4.1 Kim tra s liu trng khng (missing value)Trong nghin cu, v nhiu l do s liu khng th thu thp c cho tt c i tng, hay khng th o lng tt c bin s cho mt i tng. Trong trng hp , s liu trng c xem l missing value (tm dch l s liu trng khng). R xem cc s liu trng khng l NA. C mt s kim nh thng k i hi cc s liu trng khng phi c loi ra (v khng th tnh ton c) trc khi phn tch. R c mt lnh rt c ch cho vic ny: na.omit, v cch s dng nh sau: > chol.new nam nu old =60) > dim(old)

[1] 25

8

Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii:> n60 =60 & sex==Nam) > dim(n60)

[1] 9

8

4.3 Chit s liu t mt data .frameTrong chol c 8 bin s. Chng ta c th chit d liu chol v ch gi li nhng bin s cn thit nh m s (id), tui (age) v total cholestrol (tc). t lnh names(chol) rng bin s id l ct s 1, age l ct s 3, v bin s tc l ct s 7. Chng ta c th dng lnh sau y: > data2 data3 print(data3) id sex tc 1 1 Nam 4.0 2 2 Nu 3.5 3 3 Nu 4.7 4 4 Nam 7.7 5 5 Nam 5.0 6 6 Nu 4.2 7 7 Nam 5.9 8 8 Nam 6.1 9 9 Nam 5.9 10 10 Nu 4.0

Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra, chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3).

4.4 Nhp hai data.frame thnh mt: mergeGi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1 gm 3 ct: id, sex, tc nh sau:id sex tc 1 Nam 4.0 2 Nu 3.5 3 Nu 4.7 4 Nam 7.7 5 Nam 5.0 6 Nu 4.2 7 Nam 5.9 8 Nam 6.1 9 Nam 5.9 10 Nu 4.0

D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau:id 1 2 3 4 5 6 7 8 9 10 11 sex Nam Nu Nu Nam Nam Nu Nam Nam Nam Nu Nu tg 1.1 2.1 0.8 1.1 2.1 1.5 2.6 1.5 5.4 1.9 1.7

26

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng, cn d liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt data.frame bng cch dng lnh merge nh sau:> d d id sex.x tc sex.y tg 1 1 Nam 4.0 Nam 1.1 2 2 Nu 3.5 Nu 2.1 3 3 Nu 4.7 Nu 0.8 4 4 Nam 7.7 Nam 1.1 5 5 Nam 5.0 Nam 2.1 6 6 Nu 4.2 Nu 1.5 7 7 Nam 5.9 Nam 2.6 8 8 Nam 6.1 Nam 1.5 9 9 Nam 5.9 Nam 5.4 10 10 Nu 4.0 Nu 1.9 11 11 NA Nu 1.7

Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v a vo data.frame mi tn l d, v dng bin s id lm chun. Chng ta thy bnh nhn s 11 khng c s liu cho tc, cho nn R cho l NA (mt dng not available).

4.5 M ha s liu (data coding)Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin i s liu t bin lin tc sang bin mang tnh cch phn loi. Chng hn nh trong chn on long xng, nhng ph n c ch s T ca mt cht khong trong xng (bone mineral density hay BMD) bng hay thp hn -2.5 c xem l long xng, nhng ai c BMD gia -2.5 v -1.0 l xp xng (osteopenia), v trn -1.0 l bnh thng. V d, chng ta c s liu BMD t 10 bnh nhn nh sau:-0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11

nhp cc s liu ny vo R chng ta c th s dng function c nh sau:bmd diagnosis diagnosis[bmd -2.5 & bmd -1.0] data data bmd diagnosis 1 -0.92 3 2 0.21 3 3 0.17 3 4 -3.21 1 5 -1.80 2 6 -2.60 1 7 -2.00 2 8 1.71 3 9 2.12 3 10 -2.11 2

4.5.1 Bin i s liu bng cch dng replaceMt cch bin i s liu khc l dng replace, nhng cch ny tng i phc tp hn. Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau:> > > > diagnosis diagnosis diagnosis diagnosis age cut(age, 2)[1] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51] (7.96,29.5] (7.96,29.5] (7.96,29.5] (7.96,29.5]

[9] (7.96,29.5] (29.5,51] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51] (29.5,51] Levels: (7.96,29.5] (29.5,51]

cut chia bin age thnh 2 nhm: nhm 1 tui t 7.96 n 29.5; nhm 2 t 29.5 n 51. Chng ta c th m s i tng trong tng nhm tui bng hm table nh sau:

29

> table(cut(age, 2)) (7.96,29.5] 11 (29.5,51] 4

Trong lnh sau y, chng ta chia bin tui thnh 3 nhm v t tn ba nhm l low, medium v high:> ageg ageg table(ageg) ageg low medium 10 2 high 3 low low

Tt nhin, chng ta cng c th chia age thnh 4 nhm (quartiles) bng cch cho nhng thng s 0, 0.25, 0.50 v 0.75 nh sau:cut(age, breaks=quantiles(age, c(0, 0.25, 0.50, 0.75, 1)), labels=c(q1, q2, q3, q4), include.lowest=TRUE)

4.7. Tp hp s liu bng cut2 (Hmisc)Hm cut trn chia bin s theo gi tr ca bin, ch khng da vo s mu, cho nn s lng mu trong tng nhm khng bng nhau. Tuy nhin, trong phn tch thng k, c khi chng ta cn phi phn chia mt bin s lin tc thnh nhiu nhm da vo phn phi ca bin s nhng s mu bng hay tng ng nhau. Chng hn nh i vi bin s bmd chng ta c th ct dy s thnh 3 nhm vi s mu tng ng nhau bng cch dng function cut2 (trong package Hmisc) nh sau:# nhp package Hmisc c th dng function cut2

30

Phn tch d liu v to biu bng R Nguyn Vn Tun

> library(Hmisc) > bmd group table(group) group [-3.21,-0.92) [-0.92, 2.12] 5 5

Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g=group). R t ng chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92 n 2.12. Mi nhm gm c 5 s. Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh:> group table(group) group [-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12] 4 3 3

31

5 Dng R cho cc php tnh n gin v ma trnMt trong nhng li th ca R l c th s dng nh mt my tnh cm tay. Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp chng. Chng ny ch trnh by mt s php tnh n gin m hc sinh hay sinh vin c th s dng lp tc trong khi c nhng dng ch ny.

5.1 Tnh ton n ginCng hai s hay nhiu s vi nhau:> 15+2997 [1] 3012

Cng v tr:> 15+2997-9768 [1] -6756 > (25 - 5)^3 [1] 8000

[1] -15.42857

Nhn v chia > -27*12/21

S ly tha: (25 5)3 S pi ()> pi [1] 3.141593 > 2+3*pi [1] 11.42478

Cn s bc hai:> sqrt(10) [1] 3.162278

10

Logarit: loge S m: e

Logarit: log10> log10(100) [1] 2

> log(10) [1] 2.3025852.7689

Hm s lng gic> cos(pi) [1] -1

> exp(2.7689) [1] 15.94109 > log10(2+3*pi) [1] 1.057848

Vector> x x [1] 2 3 1 5 4 6 7 6 8 > sum(x) [1] 42 > x*2 [1] 4 16 6 2 10 8 12 14 12

> exp(x/10) [1] 1.221403 1.349859 1.105171 1.648721 1.491825 1.822119 2.013753 1.822119 [9] 2.225541 > exp(cos(x/10)) [1] 2.664634 2.599545 2.704736 2.405079 2.511954 2.282647 2.148655 2.282647 [9] 2.007132

32

Phn tch d liu v to biu bng R Nguyn Vn Tun

Tnh tng bnh phng (sum of 2 2 2 2 2 squares): 1 + 2 + 3 + 4 + 5 = ? > x sum(x^2) [1] 55

Tnh tng bnh phng iu chnh (adjusted sum of squares):

( x x )i =1 i

n

2

=?

> x sum((x-mean(x))^2) [1] 10

Tnh sai s bnh phng (mean square):

( x x )i =1 i

n

2

/n= ?

Trong cng thc trn mean(x) l s trung bnh ca vector x. Tnh phng sai (variance) v lch chun (standard deviation): Phng sai:

2 > x sum((x-mean(x))^2)/length(x) i =1 [1] 2 > x var(x) [1] 2.5 Trong cng thc trn, length(x)

n

c ngha l tng s phn t (elements) trong vector x.

lch chun:> sd(x) [1] 1.581139

s2 :

5.2 S liu v ngy thngTrong phn tch thng k, cc s liu ngy thng c khi l mt vn nan gii, v c rt nhiu cch m t cc d liu ny. Chng hn nh 01/02/2003, c khi ngi ta vit 1/2/2003, 01/02/03, 01FEB2003, 2003-02-01, v.v Tht ra, c mt qui lut chun vit s liu ngy thng l tiu chun ISO 8601 (nhng rt t ai tun theo!) Theo qui lut ny, chng ta vit:

2003-02-01L do ng sau cch vit ny l chng ta vit s vi n v ln nht trc, ri dn dn n n v nh nht. Chng hn nh vi s 123 th chng ta bit ngay rng mt trm hai mi ba: bt u l hng trm, ri n hng chc, v.v V cng l cch vit ngy thng chun ca R.> date1 date2 days days Time difference of 28 days

Chng ta cng c th to mt dy s liu ngy thng nh sau:> seq(as.Date(2005-01-01), as.Date(2005-12-31), by=month) [1] "2005-01-01" "2005-02-01" "2005-03-01" "2005-04-01" "2005-05-01" [6] "2005-06-01" "2005-07-01" "2005-08-01" "2005-09-01" "2005-10-01" [11] "2005-11-01" "2005-12-01" > seq(as.Date(2005-01-01), as.Date(2005-12-31), by=2 weeks) [1] "2005-01-01" "2005-01-15" "2005-01-29" "2005-02-12" "2005-02-26" [6] "2005-03-12" "2005-03-26" "2005-04-09" "2005-04-23" "2005-05-07" [11] "2005-05-21" "2005-06-04" "2005-06-18" "2005-07-02" "2005-07-16" [16] "2005-07-30" "2005-08-13" "2005-08-27" "2005-09-10" "2005-09-24" [21] "2005-10-08" "2005-10-22" "2005-11-05" "2005-11-19" "2005-12-03" [26] "2005-12-17" "2005-12-31"

5.3 To dy s bng hm seq, rep v glR cn c cng dng to ra nhng dy s rt tin cho vic m phng v thit k th nghim. Nhng hm thng thng cho dy s l seq (sequence), rep (repetition) v gl (generating levels):

p dng seq To ra mt vector s t 1 n 12:

34

Phn tch d liu v to biu bng R Nguyn Vn Tun

> x x [1] 1 2 3 > seq(12) [1] 1 2 3

4 4

5 5

6 6

7 7

8 8

9 10 11 12 9 10 11 12

To ra mt vector s t 12 n 5:8 8 7 7 6 5

> x x [1] 12 11 10 9 > seq(12,7) [1] 12 11 10 9

Cng thc chung ca hm seq l seq(from, to, by= )hay seq(from,to,length.out= ).Cch s dng s c minh ho bng cc v d sau y: To ra mt vector s t 4 n 6 vi khong cch bng 0.25:

> seq(4, 6, 0.25) [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00

To ra mt vector 10 s, vi s nh nht l 2 v s ln nht l 15

> seq(length=10, from=2, to=15) [1] 2.000000 3.444444 4.888889 6.333333 7.777778 9.222222 10.666667 12.111111 13.555556 15.000000

p dng rep Cng thc ca hm rep l rep(x, times, ...), trong , x l mt bin s v times l s ln lp li. V d: To ra s 10, 3 ln:> rep(10, 3) [1] 10 10 10

To ra s 1 n 4, 3 ln:

> rep(c(1:4), 3) [1] 1 2 3 4 1 2 3 4 1 2 3 4

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5) [1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8

35

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5) [1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8

p dng gl gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k, labels = 1:n, ordered = FALSE) v cch s dng s c minh ha bng vi v d sau y: To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:

> gl(2, 8) [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 Levels: 1 2

Hay mt bin gm bc 1, 2 v 3; mi bc c lp li 5 ln:> gl(3, 5) [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3

To ra bin gm bc 1 v 2; mi bc c lp li 10 ln (do length=20):

> gl(2, 10, length=20) [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Levels: 1 2

Hay:> gl(2, 2, length=20) [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 Levels: 1 2

Cho thm k hiu:

> gl(2, 5, label=c("C", "T")) [1] C C C C C T T T T T Levels: C T

To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.

> rep(1:4, c(2,2,2,2))

36

Phn tch d liu v to biu bng R Nguyn Vn Tun

[1] 1 1 2 2 3 3 4 4

Cng tng ng vi:> rep(1:4, each = 2) [1] 1 1 2 2 3 3 4 4

Vi ngy gi thng:

> x rep(x, 2) [1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-12-31 16:00:00 Pacific Standard Time" [3] "1973-12-31 16:00:00 Pacific Standard Time" "1972-06-30 17:00:00 Pacific Standard Time" [5] "1972-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00 Pacific Standard Time" > rep(as.POSIXlt(x), rep(2, 3)) [1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-06-30 17:00:00 Pacific Standard Time" [3] "1972-12-31 16:00:00 Pacific Standard Time" "1972-12-31 16:00:00 Pacific Standard Time" [5] "1973-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00 Pacific Standard Time"

5.4 S dng R cho cc php tnh ma trnNh chng ta bit ma trn (matrix), ni n gin, gm c dng (row) v ct (column). Khi vit A[m, n], chng ta hiu rng ma trn A c m dng v n ct. Trong R, chng ta cng c th th hin nh th. V d: chng ta mun to mt ma trn vung A gm 3 dng v 3 ct, vi cc phn t (element) 1, 2, 3, 4, 5, 6, 7, 8, 9, chng ta vit:

1 4 7 A = 2 5 8 3 6 9 V vi R:> y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

Nhng nu chng ta lnh:> A A

Th kt qu s l:[1,] [2,] [3,] [,1] [,2] [,3] 1 2 3 4 5 6 7 8 9

Tc l mt ma trn chuyn v (transposed matrix). Mt cch khc to mt ma trn hon v l dng t(). V d:> y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

v B = A' c th din t bng R nh sau:> B B [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9

Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho l 1. Chng ta c th to mt ma trn nh th bng R nh sau:> # to ra m ma trn 3 x 3 vi tt c phn t l 0. > A # cho cc phn t ng cho bng 1 > diag(A) diag(A) [1] 1 1 1 > # by gi ma trn A s l: > A [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1

5.4.1 Chit phn t t ma trn> y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

> # ct 1 ca ma trn A> A[,1] [1] 1 4 7

> # ct 3 ca ma trn A> A[3,] [1] 7 8 9

> # dng 1 ca ma trn A > A[1,] [1] 1 2 3 > # dng 2, ct 3 ca ma trn A> A[2,3] [1] 6

> # tt c cc dng ca ma trn A, ngoi tr dng 2> A[-2,] [,1] [,2] [,3] [1,] 1 4 7

39

[2,]

3

6

9

> # tt c cc ct ca ma trn A, ngoi tr ct 1> A[,-1] [,1] [,2] [1,] 4 7 [2,] 5 8 [3,] 6 9

> # xem phn t no cao hn 3.> A>3 [,1] [,2] [,3] [1,] FALSE TRUE TRUE [2,] FALSE TRUE TRUE [3,] FALSE TRUE TRUE

5.4.2 Tnh ton vi ma trnCng v tr hai ma trn. Cho hai ma trn A v B nh sau:> A A [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > B B [,1] [,2] [,3] [,4] [1,] -1 -4 -7 -10 [2,] -2 -5 -8 -11 [3,] -3 -6 -9 -12

Chng ta c th cng A+B:> C C [,1] [,2] [,3] [,4] [1,] 0 0 0 0 [2,] 0 0 0 0 [3,] 0 0 0 0

Hay A-B:

40

Phn tch d liu v to biu bng R Nguyn Vn Tun

> D D [,1] [,2] [,3] [,4] [1,] 2 8 14 20 [2,] 4 10 16 22 [3,] 6 12 18 24

Nhn hai ma trn. Cho hai ma trn:

1 4 7 A = 2 5 8 3 6 9

v

1 2 3 B = 4 5 6 7 8 9

Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau:> > > > > y > A E E [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > det(E) [1] 0

Nhng ma trn F sau y th c th o nghch:> F F [,1] [,2] [,3] [1,] 1 16 49 [2,] 4 25 64 [3,] 9 36 81

42

Phn tch d liu v to biu bng R Nguyn Vn Tun

> det(F) [1] -216

V nghch o ca ma trn F (F-1) c th tnh bng function solve() nh sau:> solve(F) [,1] [,2] [,3] [1,] 1.291667 -2.166667 0.9305556 [2,] -1.166667 1.666667 -0.6111111 [3,] 0.375000 -0.500000 0.1805556

Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t do to ra nhng php tnh ph hp cho tng vn c th. Trong vi chng sau, chng ta s quay li vn ny chi tit hn. R c mt package Matrix chuyn thit k cho tnh ton ma trn. Bn c c th ti package xung, ci vo my, v s dng, nu cn. a ch ti l: http://cran.au.r-project.org/bin/windows/contrib/r-release/Matrix_0.995-8.zip cng vi ti liu ch dn cch s dng (di khong 80 trang): http://cran.au.r-project.org/doc/packages/Matrix.pdf

43

6 Tnh ton xc sut v m phng (simulation)Xc sut l nn tng ca phn tch thng k. Tt c cc phng php phn tch s liu v suy lun thng k u da vo l thuyt xc sut. L thuyt xc sut quan tm n vic m t v th hin qui lut phn phi ca mt bin s ngu nhin. M t y trong thc t cng c ngha n gin l m nhng trng hp hay kh nng xy ra ca mt hay nhiu bin. Chng hn nh khi chng ta chn ngu nhin 2 i tng, v nu 2 i tng ny c th c phn loi bng hai c tnh nh gii tnh v s thch, th vn t ra l c bao nhiu tt c phi hp gia hai c tnh ny. Hay i vi mt bin s lin tc nh huyt p, m t c ngha l tnh ton cc ch s thng k ca bin nh tr s trung bnh, trung v, phng sai, lch chun, v.v T nhng ch s m t, l thuyt xc sut cung cp cho chng ta nhng m hnh thit lp cc hm phn phi cho cc bin s . Chng ny s bn qua hai lnh vc chnh l php m v cc hm phn phi.

6.1 Cc php m6.1.1 Php hon v (permutation).Theo nh ngha, hon v n phn t l cch sp xp n phn t theo mt th t nh sn. nh ngha ny kh kh hiu, v d c th sau s lm r nh ngha hn. Hy tng tng mt trung tm cp cu c 3 bc s (x, y v z), v c 3 bnh nhn (a, b v c) ang ngi ch c khm bnh. C ba bc s u c th khm bt c bnh nhn a, b hay c. Cu hi t ra l c bao nhiu cch sp xp bc s bnh nhn? tr li cu hi ny, chng ta xem xt vi trng hp sau y: Bc s x c 3 la chn: khm bnh nhn a, b hoc c; Khi bc s x chn mt bnh nhn ri, th bc s y c hai la chn cn li; V sau cng, khi 2 bc s kia chn, bc s z ch cn 1 la chn. Tng cng, chng ta c 6 la chn.

Mt v d khc, trong mt bui tic gm 6 bn, hi c bao nhiu cch sp xp cch ngi trong mt bn vi 6 gh? Qua cch l gii ca v d trn, p s l: 6.5.4.3.2.1 = 720 cch. (Ch du . c ngha l du nhn hay tch s). V y chnh l php m hon v.

44

Phn tch d liu v to biu bng R Nguyn Vn Tun

Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh hon v cho mt s n l: n ! = n ( n 1)( n 2 )( n 3) ... 1 . Trong R cch tnh ny rt n gin vi lnh prod() nh sau: Tm 3!

> prod(3:1) [1] 6

Tm 10!

> prod(10:1) [1] 3628800

Tm 10.9.8.7.6.5.4

> prod(10:4) [1] 604800

Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)

> prod(10:4) / prod(40:36) [1] 0.007659481

6.1.2 T hp (combination).T hp n phn t chp k l mi tp hp con gm k phn t ca tp hp n phn t. V d c th sau s gip cho chng ta hiu r vn ny: Cho 3 ngi (hy cho l A, B, v C) ng vin vo 2 chc ch tch v ph ch tch, hi: c bao nhiu cch chn 2 chc ny trong s 3 ngi . Chng ta c th tng tng c 2 gh m phi chn 3 ngi: Cch chn 1 2 3 4 5 6 Ch tch A B A C B C Ph ch tch B A C A C B

Nh vy c 6 cch chn. Nhng ch rng cch chn 1 v 2 trong thc t ch l 1 cp, v chng ta ch c th m l 1 (ch khng 2 c). Tng t, 3 v 4,

45

5 v 6 cng ch c th m l 1 cp. Tng cng, chng ta c 3 cch chn 3 ngi cho 2 chc v. p s ny c gi l t hp. Tht ra tng s ln chn c th tnh bng cng thc sau y:

3 3! 6 = = 3 ln. = 2 2!( 3 2 ) ! 2Ni chung, s ln chn k ngi t n ngi l:

n n! = k k !( n k ) !Cng thc ny cng c khi vit l Ckn thay v . Vi R, php tnh ny rt n gin bng hm choose(n, k). Sau y l vi v d minh ha: Tm

n k

5 2

> choose(5, 2) [1] 10

Tm xc sut cp A v B trong s 5 ngi c c c vo hai chc v:

> 1/choose(5, 2) [1] 0.1

6.2 Bin s ngu nhin v hm phn phiPhn ln phn tch thng k da vo cc lut phn phi xc sut suy lun. Nu chng ta chn ngu nhin 10 bn trong mt lp hc v ghi nhn chiu cao v gii tnh ca 10 bn , chng ta c th c mt dy s liu nh sau:Gii tnh Chiu cao (cm) 1 N 156 2 N 160 3 Nam 175 4 N 145 5 N 165 6 N 158 7 Nam 170 8 Nam 167 9 N 178 10 Nam 155

Nu tnh gp chung li, chng ta c 6 bn gi v 4 bn trai. Ni theo phn trm, chng ta c 60% n v 40% nam. Ni theo ngn ng xc sut, xc sut n l 0.6 v nam l 0.4.

46

Phn tch d liu v to biu bng R Nguyn Vn Tun

V chiu cao, chng ta c gi tr trung bnh l 162.9 cm, vi chiu cao thp nht l 155 cm v cao nht l 178 cm. Hm phn phi Chun Nh phn Poisson Uniform Negative binomial Beta Gamma Mt dnorm(x, mean, sd) dbinom(k, n, p) dpois(k, lambda) dunif(x, min, max) dnbinom(x, k, p) dbeta(x, shape1, shape2) dgamma(x, shape, rate, scale) dgeom(x, p)

Tch lypnorm(q, mean, sd) pbinom(q, n, p) ppois(q, lambda) punif(q, min, max) pnbinom(q, k, p) pbeta(q, shape1, shape2) gamma(q, shape, rate, scale) pgeom(q, p)

nh bcqnorm(p, mean, sd) qbinom (p, n, p) qpois(p, lambda) qunif(p, min, max) qnbinom (p,k,prob) qbeta(p, shape1, shape2) qgamma(p, shape, rate, scale) qgeom(p, prob)

M phngrnorm(n, mean, sd) rbinom(k, n, prob) rpois(n, lambda) runif(n, min, max) rbinom(n, n, prob) rbeta(n, shape1, shape2) rgamma(n, shape, rate, scale) rgeom(n, prob)

Geometric

Hm phn Mt phi Exponential dexp(x, Weibull Cauchy F T Chisquared

Tch lypexp(q, rate) pnorm(q, mean, sd) pcauchy(q, location, scale) pf(q, df1, df2) pt(q, df) pchi(q, df)

nh bcqexp(p, rate) qnorm(p, mean, sd) qcauchy(p, location, scale) qf(p, df1, df2) qt(p, df) qchisq(p, df)

M phngrexp(n, rate) rnorm(n, mean, sd) rcauchy(n, location, scale) rf(n, df1, df2) rt(n, df) rchisq(n, df)

rate) dnorm(x, mean, sd) dcauchy(x, location, scale) df(x, df1, df2) dt(x, df) dchisq(x, df)

Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0. Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn.

47

Ni theo ngn ng thng k xc sut, bin s gii tnh v chiu cao l hai bin s ngu nhin (random variable). Ngu nhin l v chng ta khng on trc mt cch chnh xc cc gi tr ny, nhng ch c th on gi tr tp trung, gi tr trung bnh, v dao ng ca chng. Bin gii tnh ch c hai gi tr (nam hay n), v c gi l bin khng lin tc, hay bin ri rc (discrete variable), hay bin th bc (categorical variable). Cn bin chiu cao c th c bt c gi tr no t thp n cao, v do c tn l bin lin tc (continuous variable). Khi ni n phn phi (hay distribution) l cp n cc gi tr m bin s c th c. Cc hm phn phi (distribution function) l hm nhm m t cc bin s mt cch c h thng. C h thng y c ngha l theo m m hnh ton hc c th vi nhng thng s cho trc. Trong xc sut thng k c kh nhiu hm phn phi, v y chng ta s xem xt qua mt s hm quan trng nht v thng dng nht: l phn phi nh phn, phn phi Poisson, v phn phi chun. Trong mi lut phn phi, c 4 loi hm quan trng m chng ta cn bit: Hm mt xc sut (probability density distribution); Hm phn phi tch ly (cumulative probability distribution); Hm nh bc (quantile); v Hm m phng (simulation).

R c nhng hm sn trn c th ng dng cho tnh ton xc sut. Tn mi hm c gi bng mt tip u ng ch loi hm phn phi, v vit tt tn ca hm . Cc tip u ng l d (ch distribution hay xc sut), p (ch cumulative probability, xc sut tch ly), q (ch nh bc hay quantile), v r (ch random hay s ngu nhin). Cc tn vit tt l norm (normal, phn phi chun), binom (binomial , phn phi nh phn), pois (Poisson, phn phi Poisson), v.v 2 bng tren y tm tt cc hm v thng s cho tng hm.

6.3 Cc hm phn phi xc sut (probability distribution function)6.3.1 Hm phn phi nh phn (Binomial distribution)Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng / cht, c / khng, v.v Hm nh phn c pht biu bng nh l nh sau: Nu mt th nghim c tin hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm xc sut thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l: P ( k | n, p ) = Ckn p k (1 p )nk

, trong k = 0, 1,

2, . . . , n. hiu nh l r rng hn, chng ta s xem qua vi v d sau y.

48

Phn tch d liu v to biu bng R Nguyn Vn Tun

V d 1: Hm mt nh phn (Binomial density probability function). Trong v d trn, lp hc c 10 ngi, trong c 6 n. Nu 3 bn c chn mt cch ngu nhin, xc sut m chng ta c 2 bn n l bao nhiu? Chng ta c th tr li cu hi ny mt cch tng i th cng bng cch xem xt tt c cc trng hp c th xy ra. Mi ln chn c 2 kh khng (nam hay n), v 3 ln chn, chng ta c 23 = 8 trng hp nh sau. Bn 1 Bn 2 Nam Nam Nam Nam Nam N Nam N N Nam N Nam N N N N Tt c cc trng hp Bn 3 Nam N Nam N Nam N Nam N Xc sut (0.4)(0.4)(0.4) = 0.064 (0.4)(0.4)(0.6) = 0.096 (0.4)(0.6)(0.4) = 0.096 (0.4)(0.6)(0.6) = 0.144 (0.6)(0.4)(0.4) = 0.096 (0.6)(0.4)(0.6) = 0.144 (0.6)(0.6)(0.4) = 0.144 (0.6)(0.6)(0.6) = 0.216 1.000

Chng ta bit trc rng trong nhm 10 hc sinh c 6 n, v do , xc sut n l 0.60. (Ni cch khc, xc sut chn mt bn nam l 0.4). Do , xc sut m tt c 3 bn c chn u l nam gii l: 0.4 x 0.4 x 0.4 = 0.064. Trong bng trn, chng ta thy c 3 trng hp m trong c 2 bn gi: l trng hp Nam-N-N, N-N-Nam, v N-Nam-N, c 3 u c xc sut 0.144. Cho nn, xc sut chn ng 2 bn n trong s 3 bn c chn l 3x0.144= 0.432. Trong R, c hm dbinom(k, n, p) c th gip chng ta tnh cng thc

P ( k | n, p ) = Ckn p k (1 p )

nk

mt cch nhanh chng. Trong trng hp trn,

chng ta ch cn n gin lnh:> dbinom(2, 3, 0.60) [1] 0.432

V d 2: Hm nh phn tch ly (Cumulative Binomial probability distribution). Xc sut thuc chng long xng c hiu nghim l khong 70% (tc l p = 0.70). Nu chng ta iu tr 10 bnh nhn, xc sut c ti thiu 8 bnh nhn vi kt qu tch cc l bao nhiu? Ni cch khc, nu gi X l s bnh nhn c iu tr thnh cng, chng ta cn tm P(X 8) = ? tr li cu hi ny, chng ta s dng hm pbinom(k, n, p). Xin nhc li rng hm pbinom(k, n, p)cho chng ta P(X k). Do , P(X 8) = 1 P(X 7). Cho nn, p s bng R cho cu hi l:> 1-pbinom(7, 10, 0.70)

49

[1] 0.3827828

V d 3: M phng hm nh phn: Bit rng trong mt qun th dn s c khong 20% ngi mc bnh cao huyt p; nu chng ta tin hnh chn mu 1000 ln, mi ln chn 20 ngi trong qun th mt cch ngu nhin, s phn phi s bnh nhn cao huyt p s nh th no? tr li cu hi ny, chng ta c th ng dng hm rbinom (n,k,p) trong R vi nhng thng s nh sau:> b table(b) b 0 1 2 3 4 5 6 6 45 147 192 229 169 105

7 68

8 23

9 13

10 3

Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45 mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s trn bng lnh hist nh sau: > hist(b, main="Number of hypertensive patients") Trong lnh trn b l bin s th hin cao huyt p. Kt qu ca lnh trn l mt biu th hin tn s bnh nhn cao huyt p nh sau (xem biu 1). Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p (trong mi ln chn mu 20 ngi) l cao nht (22.9%). iu ny cng c th hiu c, bi v t l cao huyt p l 20%, cho nn chng ta k vng rng trung bnh 4 ngi trong s 20 ngi c chn phi l cao huyt p. Tuy nhin, iu quan trng m biu trn th hin l c khi chng ta quan st n 10 bnh nhn cao huyt p d xc sut cho mu ny rt thp (ch 3/1000).

50

Phn tch d liu v to biu bng R Nguyn Vn Tun

Nu m b e r o f h y p e rte n s iv e p a tie n ts

Fe u n y r qec

0

5 0

10 0

10 5

20 0

0

2

4 b

6

8

10

Biu 1. Phn phi s bnh nhn cao huyt p trong s 20 ngi c chn ngu nhin trong mt qun th gm 20% bnh nhn cao huyt p, v chn mu c lp li 1000 ln. V d 4: ng dng hm phn phi nh phn: Hai mi khch hng c mi ung hai loi bia A v B, v c hi h thch bia no. Kt qu cho thy 16 ngi thch bia A. Vn t ra l kt qu ny c kt lun rng bia A c nhiu ngi thch hn bia B, hay l kt qu ch l do cc yu t ngu nhin gy nn? Chng ta bt u gii quyt vn bng cch gi thit rng nu khng c khc nhau, th xc sut p=0.50 thch bia A v q=0.5 thch bia B. Nu gi thit ny ng, th xc sut m chng ta quan st 16 ngi trong s 20 ngi thch bia A l bao nhiu. Chng ta c th tnh xc sut ny bng R rt n gin:> 1- pbinom(15, 20, 0.5) [1] 0.005908966

p s l xc sut 0.005 hay 0.5%. Ni cch khc, nu qu tht hai bia ging nhau th xc sut m 16/20 ngi thch bia A ch 0.5%. Tc l, chng ta c bng chng cho thy kh nng bia A qu tht c nhiu ngi thch hn bia B, ch khng phi do yu t ngu nhin. Ch , chng ta dng 15 (thay v 16), l bi v P(X 16) = 1 P(X 15). M trong trng hp ta ang bn, P(X 15) = pbinom(15, 20, 0.5).

6.3.2 Hm phn phi Poisson (Poisson distribution)Hm phn phi Poisson, ni chung, rt ging vi hm nh phn, ngoi tr thng s p thng rt nh v n thng rt ln. V th, hm Poisson thng c s dng m t cc bin s rt him xy ra (nh s ngi mc ung th trong mt dn s chng hn). Hm Poisson cn c ng dng kh nhiu v thnh cng trong cc nghin cu k thut v th trng nh s lng khch hng n mt nh hng mi gi.

51

V d 5:Hm mt Poisson (Poisson density probability function). Qua theo di nhiu thng, ngi ta bit c t l nh sai chnh t ca mt th k nh my. Tnh trung bnh c khong 2.000 ch th th k nh sai 1 ch. Hi xc sut m th k nh sai chnh t 2 ch, hn 2 ch l bao nhiu? V tn s kh thp, chng ta c th gi nh rng bin s sai chnh t (tm t tn l bin s X) l mt hm ngu nhin theo lut phn phi Poisson. y, chng ta c t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson pht biu rng xc sut m X = k, vi iu kin t l trung bnh , :

e k P( X = k | ) = k!Do , p s cho cu hi trn l: P ( X = 2 | = 1) =

e 212 = 0.1839 . p s 2!

ny c th tnh bng R mt cch nhanh chng hn bng hm dpois nh sau:> dpois(2, 1) [1] 0.1839397

Chng ta cng c th tnh xc sut sai 1 ch:> dpois(1, 1) [1] 0.3678794

V xc sut khng sai ch no:> dpois(0, 1) [1] 0.3678794

Ch trong hm trn, chng ta ch n gin cung cp thng s k = 2 v ( = 1. Trn y l xc sut m th k nh sai chnh t ng 2 ch. Nhng xc sut m th k nh sai chnh t hn 2 ch (tc 3, 4, 5, ch) c th c tnh bng:

P ( X > 2 ) = P ( X = 3) + P ( X = 4 ) + P ( X = 5) + ...= 1 P ( X 2) = 1 0.3678 0.3678 0.1839 = 0.08 Bng R, chng ta c th tnh nh sau: # P(X 2)

52

Phn tch d liu v to biu bng R Nguyn Vn Tun

> ppois(2, 1) [1] 0.9196986

# 1-P(X 2)> 1-ppois(2, 1) [1] 0.0803014

6.3.3 Hm phn phi chun (Normal distribution)Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng quan trng nht ca phn tch thng k. C th ni hu ht l thuyt thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi chun c hai thng s: trung bnh v phng sai 2 (hay lch chun ). Gi X l mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc sut m X = x l:

P X = x | ,

(

2

)

( x )2 1 = f ( x) = exp 2 2 2

V d 6: Hm mt phn phi chun (Normal density probability function). Chiu cao trung bnh hin nay ph n Vit Nam l 156 cm, vi lch chun l 4.6 cm. Cng bit rng chiu cao ny tun theo lut phn phi chun. Vi hai thng s =156, =4.6, chng ta c th xy dng mt hm phn phi chiu cao cho ton b qun th ph n Vit Nam, v hm ny c hnh dng nh sau:

53

Probability distribution of height in Vietnamese women

f(height)

0.00 130

0.02

0.04

0.06

0.08

140

150

160 Height

170

180

190

200

Biu 2. Phn phi chiu cao ph n Vit Nam vi trung bnh 156 cm v lch chun 4.6 cm. Trc honh l chiu cao v trc tung l xc sut cho mi chiu cao. Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin trung bnh l 156 cm v lch chun l 4.6 cm.> height plot(height, dnorm(height, 156, 4.6), type="l", ylab=f(height), xlab=Height, main="Probability distribution of height in Vietnamese women")

Vi hai thng s trn (v biu ), chng ta c th c tnh xc sut cho bt c chiu cao no. Chng hn nh xc sut mt ph n Vit Nam c chiu cao 160 cm l:

(160 156 )2 1 P(X = 160 | =156, =4.6) = exp 2 4.6 2 3.1416 2 ( 4.6 ) = 0.0594 Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt cch gn nh:> dnorm(160, mean=156, sd=4.6)

54

Phn tch d liu v to biu bng R Nguyn Vn Tun

[1] 0.05942343

Hm xc sut chun tch ly (cumulative normal probability function). V chiu cao l mt bin s lin tc, trong thc t chng ta t khi no mun tm xc sut cho mt gi tr c th x, m thng tm xc sut cho mt khong gi tr a n b. Chng hn nh chng ta mun bit xc sut chiu cao t 150 n 160 cm (tc l P(160 X 150), hay xc sut chiu cao thp hn 145 cm, tc P(X < 145). tm p s cc cu hi nh th, chng ta cn n hm xc sut chun tch ly, c nh ngha nh sau: P(a X b) =

f ( x ) dxa

b

V th, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu 2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho mt phn phi chun rt c ch. pnorm (a, mean, sd) =

a

f ( x ) dx = P(X a | mean, sd)

Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%:> pnorm(150, 156, 4.6) [1] 0.0960575

Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:> 1-pnorm(164, 156, 4.6) [1] 0.04100591

Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165 cm. V d 7: ng dng lut phn phi chun: Trong mt qun th, chng ta bit rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr li bng R l:> 1-pnorm(120, mean=100, sd=13) [1] 0.0619679

Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg.

55

6.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)Mt bin X tun theo lut phn phi chun vi trung bnh v phng sai 2 thng c vit tt l: X ~ N( , 2) y v 2 ty thuc vo n v o lng ca bin s. Chng hn nh chiu cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm, v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1. Sau vi thao tc s hc, c th chng minh cch bin i X p ng iu kin trn l:

Z=

X

Ni theo ngn ng ton: nu X ~ N( , 2), th (X )/2 ~ N(0, 1). Nh vy qua cng thc trn, Z thc cht l khc bit gia mt s v trung bnh tnh bng s lch chun. Nu Z = 0, chng ta bit rng X bng s trung bnh . Nu Z = -1, chng ta bit rng X thp hn ng 1 lch chun. Tng t, Z = 2.5, chng ta bit rng X cao hn ng 2.5 lch chun, v.v Biu phn phi chiu cao ca ph n Vit Nam c th m t bng mt n v mi, l ch s z nh sau:Probability distribution of height in Vietnamese women0.4 f(z) 0.0 -4 0.1 0.2 0.3

-2

0 z

2

4

Biu 3. Phn phi chun ha chiu cao ph n Vit Nam.

56

Phn tch d liu v to biu bng R Nguyn Vn Tun

Biu 3 c v bng hai lnh sau y:> height plot(height, dnorm(height, 0, 1), type="l", ylab=f(z), xlab=z, main="Probability distribution of height in Vietnamese women")

Vi phn phi chun chun ho, chng ta c mt tin li l c th dng n m t v so snh mt phn phi ca bt c bin no, v tt c u c chuyn sang ch s z. Trong biu trn, trc tung l xc sut z v trc honh l bin s z. Chng ta c th tnh ton xc sut z nh hn mt hng s (constant) no bng R. V d, chng ta mun tm P(z -1.96) = ? cho mt phn phi m trung bnh l 0 v lch chun l 1.> pnorm(-1.96, mean=0, sd=1) [1] 0.02499790

Hay P(z 1.96) = ?> pnorm(1.96, mean=0, sd=1) [1] 0.9750021

Do , P(-1.96 < z < 1.96) chnh l:> pnorm(1.96) - pnorm(-1.96) [1] 0.9500042

Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn chng ta khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default value) ca thng s mean l 0 v sd l 1). V d 6 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170 cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v t l cc ph n Vit Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%.> 1-pnorm(3.04)

57

[1] 0.001182891

Tm nh lng (quantile) ca mt phn phi chun. i khi chng ta cn lm mt tnh ton o ngc. Chng hn nh chng ta mun bit: nu xc sut Z nh hn mt hng s z no cho trc bng p, th z l bao nhiu? Din t theo k hiu xc sut, chng ta mun tm z trong nu: P(Z < z) = p tr li cu hi ny, chng ta s dng hm qnorm(p, mean=, sd=). V d 8: Bit rng Z ~ N(0, 1) v nu P(Z < z) = 0.95, chng ta mun tm z.> qnorm(0.95, mean=0, sd=1) [1] 1.644854

Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:> qnorm(0.975, mean=0, sd=1) [1] 1.959964

6.3.5 Hm phn phi t, F v 2Cc hm phn phi t, F v 2 trong thc t l hm ca hm phn phi chun. Mi lin h v cch tnh cc hm ny c th c m t bng vi ghi ch sau y: Phn phi Khi bnh phng (2). Phn phi 2 xut pht t tng bnh phng ca mt bin phn phi chun. Nu nu xi ~ N(0, 1), v gi u =

xi=

n

2 i

, th u tun theo lut phn phi Khi bnh phng vi bc

2 t do n (thng vit tt l df). Ni theo ngn ng ton, u ~ n .

V d 9: Tm xc sut ca mt bin Khi bnh phng, do , ch cn hai thng s u v n. Chng hn nh nu chng ta mun tm xc sut P(u=21, df=13), ch n gin dng hm pchisq nh sau:> dchisq(21, 13) [1] 0.01977879

58

Phn tch d liu v to biu bng R Nguyn Vn Tun

Tm xc sut m mt bin s u nh hn 21 vi bc t do 13 df. Tc l tm P(u 21 | df=13) = ?> pchisq(21, 13) [1] 0.92707142 Cng c th ni kt qu trn cho bit P( 13 < 21) = 0.927.

Tm quantile ca mt tr s u tng ng vi 90% ca mt phn phi 2 vi 15 bc t do:> qchisq(0.95, 15) [1] 24.995792 Ni cch khc, P( 15 < 24.99) = 0.95.

Phi trung tm (Non-centrality). Ch trong nh ngha trn, phn phi 2 xut pht t tng bnh phng ca mt bin phn phi chun c trung bnh 0 v phng sai 1. Nhng nu mt bin phn phi chun c trung bnh khng phi l 0 v phng sai khng phi l 1, th chng ta s c mt phn phi Khi bnh phng phi trung tm. Nu xi ~ N(i, 1) v t u =

xi =1

n

2 i

, th u tun theo lut phn phi Khi bnh phng

phi trung tm vi bc t do n v thng s phi trung tm (non-centrality parameter) nh sau:

= i2i =1

n

V k hiu l u ~

2 n ,

. C th ni rng, trung bnh ca u l n+, v

phng sai ca u l 2(n+2). Tm xc sut m u nh hn hoc bng 21, vi iu kin bc t do l 13 v thng s non-centrality bng 5.4:> pchisq(21, 13, 5.4) [1] 0.68376492 Tc l, P( 13,5.4 < 21) = 0.684.

Tm quantile ca mt tr s tng ng vi 50% ca mt phn phi 2 vi 7 bc t do v thng s non-centrality bng 3.

59

> qchisq(0.5, 7, 3) [1] 9.1801482 Do , P( 7 ,3 < 9.180148) = 0.50

Phn phi t (t distribution). Chng ta va bit rng nu X ~ N(, s2) th (X )/2 ~ N(0, 1). Nhng pht biu ng (hay chnh xc) khi chng ta bit phng sai 2. Trong thc t, t khi no chng ta bit chnh xc phng sai, m ch c tnh t s liu thc nghim. Trong trng hp phng sai c c tnh t s liu nghin cu, v hy gi c tnh ny l s2, th chng ta c th pht biu rng: (X )/s2 ~ t(0, v), trong v l bc t do. V d 10. Tm xc sut m x ln hn 1, trong bin theo lut phn phi t vi 6 bc t do:> 1-pt(1.1, 6) [1] 0.1567481

Tc l, P(t6 > 1.1) = 1 P(t6 < 1.1) = 0.157. Tm nh lng ca mt tr s tng ng vi 95% ca mt phn phi t vi 15 bc t do:> qt(0.95, 15) [1] 1.753050

Ni cch khc, P(t19 < 1.75035) = 0.95. Phn phi F. T s gia hai bin s theo lut phn phi 2 c th chng 2 minh l tun theo lut phn phi F. Ni cch khc, nu u ~ n v2 v ~ m , th u/v ~ Fn,m, trong n l bc t do t s (numerator degrees of

freedom) v m l bc t do mu s (denominator degrees of freedom). V d 11: Tm xc sut m mt tr s F ln hn 3.24, bit rng bin s tun theo lut phn phi F vi bc t do 3 v 15 df v thng s noncentrality 5:> 1-pf(3.24, 3, 15, 5) [1] 0.3558721

Do , P(F3, 15, 5 > 3.24) = 1 - P(F3, 15,5 3.24) = 0.355338.

60

Phn tch d liu v to biu bng R Nguyn Vn Tun

Vi bc t do 3 v 15, tm C sao cho P(F3, 15 > C) = 0.05. Li gii ca R l:> qf(1-0.05, 3, 15) [1] 3.287382

Ni cch khc, P(F3, 15 > 3.287382) = 1 P(F3, 15 3.287382) = 1 0.95 = 0.05

6.4 M phng (simulation)Trong phn tch thng k, i khi v hn ch s mu chng ta kh c th c tnh mt cch chnh xc cc thng s, v trong trng hp bt nh , chng ta cn n m phng bit c dao ng ca mt hay nhiu thng s. M phng thng da vo cc lut phn phi. y l mt lnh vc kh phc tp khng nm trong phm vi ca chng ny. y, chng ta ch im mt s m hnh m phng mang tnh minh ha bn c c th da vo m pht trin thm. V d 11: M phng chng minh phng sai ca s trung bnh bng phng sai chia cho n ( var X = 2 / n ). Chng ta s xem mt bin s

( )

khng lin tc vi gi tr 1, 3 v 5 vi xc sut nh sau: x 1 3 5 P(x) 0.60 0.30 0.10

Qua s liu ny, chng ta bit rng gi tr trung bnh l (1x0.60)+(3x0.30)+(5x0.10) = 2.0 v phng sai (bn c c th t tnh) l 1.8. By gi chng ta s dng hai thng s ny th m phng 500 ln. Lnh th nht to ra 3 gi tr ca x. Lnh th hai nhp s xc sut cho tng gi tr ca x. Lnh sample yu cu R to nn 500 s ngu nhin v cho vo i tng draws.x drawmeans = apply(draws, 2, mean)

Lnh th nht v th hai to nn i tng tn l draws vi 4 dng, mi dng c 500 gi tr t lut phn phi trn. Ni cch khc, chng ta c 4*500 = 2000 s. 500 s cng c ngha l 500 ct: 1 n 500. Tc mi ct c 4 s. Lnh th ba tm tr s trung bnh cho mi ct. Lnh ny s cho ra 500 s trung bnh v cha trong i tng drawmeans. Biu sau y cho thy phn phi ca 500 s trung bnh:> hist(drawmeans,breaks=seq(1,5,by=0.25), main=1000 means of 4 draws)

62

Phn tch d liu v to biu bng R Nguyn Vn Tun

1000 means of 4 draws10 5 F q n re ue cy 0 5 0 10 0

1

2

3 drawmeans

4

5

Chng ta thy rng phng sai ca phn phi ny nh hn. Tht ra, phng sai ca 500 s trung bnh ny l 0.45.> var(drawmeans) [1] 0.4501112

y l gi tr tng ng vi gi tr 0.45 m chng ta k vng t cng thc

var ( X ) = 2 / 4 = 1.8 / 4 = 0.45 .

6.4.1 M phng phn phi nh phnV d 12: M phng mu t mt qun th vi lut phn phi nh phn. Gi d chng ta bit mt qun th c 20% ngi b bnh i ng (xc sut p=0.2). Chng ta mun ly mu t qun th ny, mi mu c 20 i tng, v phng n chn mu c lp li 100 ln:> bin bin[1] 4 4 5 3 2 2 3 2 5 4 3 6 7 3 4 4 1 5 3 5 3 4 4 5 1 4 4 4 4 3 2 4 2 2 5 4 5 [38] 7 3 5 3 3 4 3 2 4 5 2 4 5 5 4 2 2 2 8 5 5 5 3 4 5 7 4 3 6 4 6 6 8 8 3 3 1 [75] 1 4 4 2 3 9 7 4 4 0 0 8 6 9 3 1 4 5 6 4 5 3 2 4 3 2

Kt qu trn l s ln u, chng ta s c 4 ngi mc bnh; ln 2 cng 4 ngi; ln 3 c 5 ngi mc bnh; v.v kt qu ny c th tm lc trong mt biu nh sau:> hist(bin, xlab=Number of diabetic patients, ylab=Number of samples, main=Distribution of the number of diabetic patients)

63

Distribution of the number of diabetic patients

Number of samples

0

5

10

15

20

25

0

2

4

6

8

Number of diabetic patients

> mean(bin) [1] 3.97

ng nh chng ta k vng, v chn mi ln 20 i tng v xc sut 20%, nn chng ta tin on trung bnh s c 4 bnh nhn i ng.

6.4.2 M phng phn phi PoissonV d 13: M phng mu t mt qun th vi lut phn phi Poisson. Trong v d sau y, chng ta m phng 100 mu t mt qun th tun theo lut phn phi Poisson vi trung bnh =3:> pois pois > pois[1] 4 3 2 4 2 3 4 4 0 7 5 0 3 3 4 2 2 6 1 4 2 3 3 5 4 2 1 4 0 2 1 5 1 2 2 2 6 [38] 1 3 6 3 3 5 4 3 2 2 5 3 3 3 1 4 7 3 4 3 2 6 1 4 1 0 5 2 2 2 3 6 8 4 4 1 4 [75] 1 0 0 4 3 3 2 3 3 3 4 1 5 4 4 1 3 1 6 4 4 4 2 2 2 4

V mt phn phi:

64

Phn tch d liu v to biu bng R Nguyn Vn Tun

Histogram of pois

Frequency

0 0

5

10

15

20

2

4 pois

6

8

Phn phi Poisson v phn phi m. Trong v d sau y, chng ta m phng thi gian bnh nhn n mt bnh vin. Bit rng bnh nhn n bnh vin mt cch ngu nhin theo lut phn phi Poisson, vi trung bnh 15 bnh nhn cho mi 150 pht. C th chng minh rng khong cch thi gian n bnh vin gia hai bnh nhn tun theo lut phn phi m. Chng ta mun bit thi gian m bnh nhn gh bnh vin; do , chng ta m phng 15 thi gian gia hai bnh nhn t lut phn phi m vi t l 15/150 = 0.1 mi pht. Cc lnh sau y p ng yu cu :# To thi gian n bnh vin > appoint times times [1] 37 5 8 10 24 5 1 7 8

6 12

6

3 25 15

6.4.3 M phng phn phi 2, t, FCch m phng trn y cn c th p dng cho cc lut phn phi khc nh nh phn m (negative binomial distribution vi rnbinom), gamma (rgamma), beta (rbeta), Khi bnh phng (rchisq), hm m (rexp), t (rt), F (rf), v.v Cc thng s cho cc hm m phng ny c th tm trong phn u ca chng. Cc lnh sau y s minh ha cc lut phn phi thng thng :

65

Phn phi Khi bnh phng vi mt s bc t do:

> curve(dchisq(x, 1), xlim=c(0,10), ylim=c(0,0.6), col="red", lwd=3) > curve(dchisq(x, 2), add=T, col="green", lwd=3) > curve(dchisq(x, 3), add=T, col="blue", lwd=3) > curve(dchisq(x, 5), add=T, col="orange", lwd=3) > abline(h=0, lty=3) > abline(v=0, lty=3) > legend(par("usr")[2], par("usr")[4], xjust=1, c("df=1", "df=2", "df=3", "df=5"), lwd=3, lty=1, col=c("red", "green", "blue", "orange"))

0.6

df=1 df=2 df=3 df=5

dchisq(x, 1)

0.0

0.1

0.2

0.3

0.4

0.5

0

2

4 x

6

8

10

Biu 4. Phn phi Khi bnh phng vi bc t do =1, 2, 3, 5. Phn phi t:

> curve(dt(x, 1), xlim=c(-3,3), ylim=c(0,0.4), col="red", lwd=3) > curve(dt(x, 2), add=T, col="blue", lwd=3) > curve(dt(x, 5), add=T, col="green", lwd=3) > curve(dt(x, 10), add=T, col="orange", lwd=3) > curve(dnorm(x), add=T, lwd=4, lty=3) > title(main=Student T distributions) > legend(par("usr")[2], par("usr")[4], xjust=1, c("df=1", "df=2", "df=5", "df=10", "Normal distribution"),

66

Phn tch d liu v to biu bng R Nguyn Vn Tun

lwd=c(2,2,2,2,2), lty=c(1,1,1,1,3), col=c("red", "blue", "green", "orange", par("fg")))

Student T distributions0.4 df=1 df=2 df=5 df=10 Normal distribution

dt(x, 1)

0.0 -3

0.1

0.2

0.3

-2

-1

0 x

1

2

3

Biu 5. Phn phi t vi bc t do =1, 2, 5, 10 so snh vi phn phi chun. > > > > > > > > >

Phn phi F:curve(df(x,1,1), xlim=c(0,2), ylim=c(0,0.8), lwd=3) curve(df(x,3,1), add=T) curve(df(x,6,1), add=T, lwd=3) curve(df(x,3,3), add=T, col="red") curve(df(x,6,3), add=T, col="red", lwd=3) curve(df(x,3,6), add=T, col="blue") curve(df(x,6,6), add=T, col="blue", lwd=3) title(main="Fisher F distributions") legend(par("usr")[2], par("usr")[4], xjust=1, c("df=1,1", "df=3,1", "df=6,1", "df=3,3", "df=6,3", "df=3,6", df="6,6"), lwd=c(1,1,3,1,3,1,3), lty=c(2,1,1,1,1,1,1),

67

col=c(par("fg"), par("fg"), par("fg"), "red", "blue", "blue"))

Fisher F distributions0.8 df=1,1 df=3,1 df=6,1 df=3,3 df=6,3 df=3,6 6,6

df(x, 1, 1)

0.0 0.0

0.2

0.4

0.6

0.5

1.0 x

1.5

2.0

Biu 6. Phn phi F vi nhiu bc t do khc nhau. > > > > > > >

Phn phi gamma:curve( dgamma(x,1,1), xlim=c(0,5) ) curve( dgamma(x,2,1), add=T, col='red' ) curve( dgamma(x,3,1), add=T, col='green' ) curve( dgamma(x,4,1), add=T, col='blue' ) curve( dgamma(x,5,1), add=T, col='orange' ) title(main="Gamma probability distribution function") legend(par('usr')[2], par('usr')[4], xjust=1, c('k=1 (Exponential distribution)', 'k=2', 'k=3', 'k=4', 'k=5'), lwd=1, lty=1, col=c(par('fg'), 'red', 'green', 'blue', 'orange') )

68

Phn tch d liu v to biu bng R Nguyn Vn Tun

Gamma probability distribution function1.0 k=1 (Exponential distribution) k=2 k=3 k=4 k=5

dgamma(x, 1, 1)

0.0

0.2

0.4

0.6

0.8

0

1

2 x

3

4

5

Biu 7. Phn phi Gamma vi nhiu hnh dng.

> > > > > > > > > > > >

Phn phi beta:curve( dbeta(x,1,1), xlim=c(0,1), ylim=c(0,4) ) curve( dbeta(x,2,1), add=T, col='red' ) curve( dbeta(x,3,1), add=T, col='green' ) curve( dbeta(x,4,1), add=T, col='blue' ) curve( dbeta(x,2,2), add=T, lty=2, lwd=2, col='red' ) curve( dbeta(x,3,2), add=T, lty=2, lwd=2, col='green' ) curve( dbeta(x,4,2), add=T, lty=2, lwd=2, col='blue' ) curve( dbeta(x,2,3), add=T, lty=3, lwd=3, col='red' ) curve( dbeta(x,3,3), add=T, lty=3, lwd=3, col='green' ) curve( dbeta(x,4,3), add=T, lty=3, lwd=3, col='blue' ) title(main="Beta distribution") legend(par('usr')[1], par('usr')[4], xjust=0, c('(1,1)', '(2,1)', '(3,1)', '(4,1)', '(2,2)', '(3,2)', '(4,2)', '(2,3)', '(3,3)', '(4,3)' ), lwd=1, #c(1,1,1,1, 2,2,2, 3,3,3), lty=c(1,1,1,1, 2,2,2, 3,3,3), col=c(par('fg'), 'red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green', 'blue' ))

69

Beta distribution(1,1) (2,1) (3,1) (4,1) (2,2) (3,2) (4,2) (2,3) (3,3) (4,3) 4 dbeta(x, 1, 1) 0 0.0 1 2 3

0.2

0.4 x

0.6

0.8

1.0

Biu 8. Phn phi beta vi nhiu hnh dng. > > > > > >

Phn phi Weibull:curve(dexp(x), xlim=c(0,3), ylim=c(0,2)) curve(dweibull(x,1), lty=3, lwd=3, add=T) curve(dweibull(x,2), col='red', add=T) curve(dweibull(x,.8), col='blue', add=T) title(main="Weibull Probability Distribution Function") legend(par('usr')[2], par('usr')[4], xjust=1, c('Exponential', 'Weibull, shape=1', 'Weibull, shape=2', 'Weibull, shape=.8'), lwd=c(1,3,1,1), lty=c(1,3,1,1), col=c(par("fg"), par("fg"), 'red', 'blue'))

70

Phn tch d liu v to biu bng R Nguyn Vn Tun

Weibull Probability Distribution Function2.0 Exponential Weibull, shape=1 Weibull, shape=2 Weibull, shape=.8

dexp(x)

0.0 0.0

0.5

1.0

1.5

0.5

1.0

1.5 x

2.0

2.5

3.0

Biu 9. Phn phi Weibull. Phn phi Cauchy:

> curve(dcauchy(x),xlim=c(-5,5), ylim=c(0,.5), lwd=3) > curve(dnorm(x), add=T, col='red', lty=2) > legend(par('usr')[2], par('usr')[4], xjust=1, c('Cauchy distribution', 'Gaussian distribution'), lwd=c(3,1), lty=c(1,2), col=c(par("fg"), 'red'))

0.5

C auchy distribution Gaussian distribution

dcauchy(x)

0.0

0.1

0.2

0.3

0.4

-4

-2

0 x

2

4

Biu 9. Phn phi Cauchy so snh vi phn phi chun.

71

6.5 Chn mu ngu nhin (random sampling)Trong xc sut v thng k, ly mu ngu nhin rt quan trng, v n m bo tnh hp l ca cc phng php phn tch v suy lun thng k. Vi R, chng ta c th ly mt mu ngu nhin bng cch s dng hm sample. V d: Chng ta c mt qun th gm 40 ngi (m s 1, 2, 3, , 40). Nu chng ta mun chn 5 i tng qun th , ai s l ngi c chn? Chng ta c th dng lnh sample() tr li cu hi nh sau:> sample(1:40, 5) [1] 32 26 6 18 9

Kt qu trn cho bit i tng 32, 26, 8, 18 v 9 c chn. Mi ln ra lnh ny, R s chn mt mu khc, ch khng hon ton ging nh mu trn. V d:> sample(1:40, 5) [1] 5 22 35 19 4 > sample(1:40, 5) [1] 24 26 12 6 22 > sample(1:40, 5) [1] 22 38 11 6 18

v.vTrn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn vo qun th. Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng, chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10 ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with replacement), chng ta ch cn thm tham s replace=TRUE:> sample(1:50, 10, replace=T) [1] 31 44 6 8 47 50 10 16 29 23

72

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hay nm mt ng xu 10 ln; mi ln, d nhin ng xu c 2 kt qu H v T; v kt qu 10 ln c th l:> sample(c("H", "T"), 10, replace=T) [1] "H" "T" "H" "H" "H" "T" "H" "H" "T" "T"

Cng c th tng tng chng ta c 5 qu banh mu xanh (X) v 5 qu banh mu (D) trong mt bao. Nu chng ta chn 1 qu banh, ghi nhn mu, ri li vo bao; ri li chn 1 qu banh khc, ghi nhn mu, v b vo bao li. C nh th, chng ta chn 20 ln, kt qu c th l:> sample(c("X", "D"), 20, replace=T) [1] "X" "D" "D" "D" "D" "D" "X" "X" "X" "X" "X" "D" "X" "X" "D" "X" "X" "X" "X" [20] "D"

Ngoi ra, chng ta cn c th ly mu vi mt xc sut cho trc. Trong hm sau y, chng ta chn 10 i tng t dy s 1 n 5, nhng xc sut khng bng nhau:> sample(5, 10, prob=c(0.3, 0.4, 0.1, 0.1, 0.1), replace=T) [1] 3 1 3 2 2 2 2 2 5 1

i tng 1 c chn 2 ln, i tng 2 c chn 5 ln, i tng 3 c chn 2 ln, v.v Tuy khng hon ton ph hp vi xc sut 0.3, 0.4, 0.1 nh cung cp v s mu cn nh, nhng cng khng qu xa vi k vng.

73

7 Kim nh gi thit thng k v ngha ca tr s P (P-value)7.1 Tr s PTrong nghin cu khoa hc, ngoi nhng d kin bng s, biu v hnh nh, con s m chng ta thng hay gp nht l tr s P (m ting Anh gi l P-value). Trong cc chng sau y, bn c s gp tr s P rt nhiu ln, v i a s cc suy lun phn tch thng k, suy lun khoa hc u da vo tr s P. Do , trc khi bn n cc phng php phn tch thng k bng R, cn phi c ngha ca tr s ny. Tr s P l mt con s xc sut, tc l vit tt ch probability value. Chng ta thng gp nhng pht biu c km theo con s, chng hn nh Kt qu phn tch cho thy t l gy xng trong nhm bnh nhn c iu tr bng thuc Alendronate l 2%, thp hn t l trong nhm bnh nhn khng c cha tr (5%), v mc khc bit ny c ngha thng k (p = 0.01), hay mt pht biu nh Sau 3 thng iu tr, mc gim p sut mu trong nhm bnh nhn l 10% (p < 0.05). Trong vn cnh trn y, i a s nh khoa hc hiu rng tr s P phn nh xc sut s hiu nghim ca thuc Alendronate hay mt thut iu tr. C nhiu ngi hiu rng cu vn trn c ngha l xc sut m thuc Alendronate tt hn gi dc l 0.99 (ly 1 tr cho 0.01). Nhng cch hiu hon ton sai. Tht vy, rt nhiu ngi, khng ch ngi c m ngay c chnh cc tc gi ca nhng bi bo khoa hc, khng hiu ng ngha ca tr s P. Theo mt nghin cu c cng b trn tp san danh ting Statistics in Medicine [1], tc gi cho bit 85% cc tc gi khoa hc v bc s nghin cu khng hiu hay hiu sai ngha ca tr s P. Th th, cu hi cn t ra mt cch nghim chnh: ngha ca tr s P l g? tr li cho cu hi ny, chng ta cn phi xem xt qua khi nim phn nghim v tin trnh ca mt nghin cu khoa hc.

7.2 Gi thit khoa hc v phn nghimMt gi thit c xem l mang tnh khoa hc nu gi thit c kh nng phn nghim. TheoKarl Popper, nh trit hc khoa hc, c im duy nht c th phn bit gia mt l thuyt khoa hc thc th vi ngy khoa hc (pseudoscience) l thuyt khoa hc lun c c tnh c th b bc b (hay b

74

Phn tch d liu v to biu bng R Nguyn Vn Tun

phn bc falsified) bng nhng thc nghim n gin. ng gi l kh nng phn nghim (falsifiability, c ti liu ghi l falsibility). Php phn nghim l phng cch tin hnh nhng thc nghim khng phi xc minh m ph phn cc l thuyt khoa hc, v c th coi y nh l mt nn tng cho khoa hc thc th. Chng hn nh gi thit Tt c cc qu u mu en c th b bc b nu ta tm ra c mt con qu mu . C th xem qui trnh phn nghim l mt cch hc hi t sai lm. Khoa hc pht trin cng mt phn ln l do hc hi t sai lm m gii khoa hc khng ai chi ci. C th xc nh nghin cu khoa hc nh l mt qui trnh th nghim gi thuyt, theo cc bc sau y: Bc 1, nh nghin cu cn phi nh ngha mt gi thuyt o (null hypothesis), tc l mt gi thuyt ngc li vi nhng g m nh nghin cu tin l s tht. Th d trong mt nghin cu lm sng, gm hai nhm bnh nhn: mt nhm c iu tr bng thuc A, v mt nhm c iu tr bng placebo, nh nghin cu c th pht biu mt gi thuyt o rng s hiu nghim thuc A tng ng vi s hiu nghim ca placebo (c ngha l thuc A khng c tc dng nh mong mun). Bc 2, nh nghin cu cn phi nh ngha mt gi thuyt ph (alternative hypothesis), tc l mt gi thuyt m nh nghin cu ngh l s tht, v iu cn c chng minh bng d kin. Chng hn nh trong v d trn y, nh nghin cu c th pht biu gi thuyt ph rng thuc A c hiu nghim cao hn placebo. Bc 3, sau khi thu thp y nhng d kin lin quan, nh nghin cu dng mt hay nhiu phng php thng k kim tra xem trong hai gi thuyt trn, gi thuyt no c xem l kh d. Cch kim tra ny c tin hnh tr li cu hi: nu gi thuyt o ng, th xc sut m nhng d kin thu thp c ph hp vi gi thuyt o l bao nhiu. Gi tr ca xc sut ny thng c cp n trong cc bo co khoa hc bng k hiu P value. iu cn ch y l nh nghin cu khng th nghim gi thuyt khc, m ch th nghim gi thuyt o m thi. Bc 4, quyt nh chp nhn hay loi b gi thuyt o, bng cch da vo gi tr xc sut trong bc th ba. Chng hn nh theo truyn thng la chn trong mt nghin cu y hc, nu gi tr xc sut nh hn 5% th nh nghin cu sn sng bc b gi thuyt o: s hiu nghim ca thuc A khc vi s hiu nghim ca placebo. Tuy nhin, nu gi tr xc sut cao hn 5%, th nh nghin cu ch c th pht biu rng cha c bng chng y bc b gi thuyt o, v iu ny khng c ngha rng gi thuyt o l ng, l s tht. Ni mt cch khc, thiu bng chng khng c ngha l khng c bng chng.

75

Bc 5, nu gi thuyt o b bc b, th nh nghin cu mc nhin tha nhn gi thuyt ph. Nhng vn khi i t y, bi v c nhiu gi thuyt ph khc nhau. Chng hn nh so snh vi gi thuyt ph ban u (A khc vi Placebo), nh nghin cu c th t ra nhiu gi thuyt ph khc nhau nh thuc s hiu nghim ca thuc A cao hn Placebo 5%, 10% hay ni chung X%. Ni tm li, mt khi nh nghin cu bc b gi thuyt o, th gi thuyt ph c mc nhin cng nhn, nhng nh nghin cu khng th xc nh gi thuyt ph no l ng vi s tht.

7.3 ngha ca tr s P qua m phng hiu ngha thc t ca tr s P, chng ta s ly mt v d n gin nh sau: V d 1. Mt th nghim c tin hnh tm hiu s thch ca ngi tiu th i vi hai loi c ph (hy tm gi l c ph A v B). Cc nh nghin cu cho 50 khch hng ung th hai loi c ph trong cng mt iu kin, v hi h thch loi c ph no. Kt qu cho thy 35 ngi thch c ph A, v 15 ngi thch c ph B. Vn t ra l qua kt qu ny, cc nh nghin cu c th kt lun rng c ph loi A c a chung hn c ph B, hay kt qu trn ch l do ngu nhin m ra? Do ngu nhin m ra c ngha l theo lut nh phn, kh nng m kt qu trn xy ra l bao nhiu? Do , l thuyt xc sut nh phn c phn ng dng trong trng hp ny, bi v kt qu ca nghin cu ch c hai gi tr (hoc l thch A, hoc thch B). Ni theo ngn ng ca phn nghim, gi thit o l nu khng c s khc bit v s thch, xc sut m mt khch hng a chung mt loi c ph l 0.5. Nu gi thit ny l ng (tc p = 0.5, p y l xc sut thch c ph A), v nu nghin cu trn c lp i lp li (chng hn nh) 1000 ln, v mi ln vn 50 khch hng, th c bao nhiu ln vi 35 khch hng a chung c ph A? Gi s ln nghin cu m 35 (hay nhiu hn) trong s 50 thch c ph A l bin c X, ni theo ngn ng xc sut, chng ta mun tm P(X | p=0.50) =? tr li cu hi ny, chng ta c th ng dng hm rbinom m phng v nh ni trn thc cht ca vn l mt phn phi nh phn:> bin table(bin) bin14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 1 2 11 16 24 47 60 83 94 107 132 114 98 65 44 44 26 14 12 34 35 2 3

Qua kt qu trn, chng ta thy trong s 1000 nghin cu , ch c 3 nghin cu m s khch hng thch c ph A l 35 ngi (vi iu kin khng c khc bit gia hai loi c ph, hay ni ng hn l nu p =0.5). Ni cch khc: P(X 35 | p=0.50) = 3/1000 = 0.003 Chng ta cng c th th hin tn s trn bng mt biu tn s nh sau:Histogram of bin250 Frequency 0 50 100 150 200

15

20

25 bin

30

35

Tt nhin chng ta c th lm mt m phng khc vi s ln ti th nghim l 100.000 ln (thay v 1000 ln) v tnh xc sut P(X 35 | p=0.50).bin bin table(bin) bin11 12 13 14 15 4 17 40 83 197 24 25 26 27 28 16 17 18 19 20 21 22 23 462 946 1592 2719 4098 5892 7937 9733 29 30 31 32 33 34 35 36

77

10822 11191 10799 9497 7925 5904 4185 2682 1562 98 37 31 38 5 39 7 40 1

893

455

223

Ln ny, chng ta c nhiu kh nng hn (v s ln m phng tng ln). Chng hn nh c th c nghin cu cho ra 11 khch hng (ti thiu) hay 40 khch hng (ti a) thch c ph A. Nhng chng ta mun bit s ln nghin cu m 35 khch hng tr ln thch c ph A, v kt qu trn cho chng ta bit, xc sut l:> (223+98+21+5+7+1)/100000 [1] 0.00355

Ni cch khc, xc sut P(X 35 | p=0.50) qu thp (ch 0.3%), chng ta c bng chng cho rng kt qu trn c th khng do cc yu t ngu nhin gy nn; tc c mt s khc bit v s thch ca khch hng i vi hai loi c ph. Con s P = 0.0035 chnh l tr s P. Theo mt qui c khoa hc, tt c cc tr s P thp hn 0.05 (tc thp hn 5%) c xem l significant, tc l c ngha thng k. Cn phi nhn mnh mt ln na hiu ngha ca tr s P nh sau: Mc ch ca phn tch trn l nhm tr li cu hi: nu hai loi c ph c xc sut a chung bng nhau (p = 0.5, gi thuyt o), th xc sut m kt qu trn (35 trong s 50 khch hng thch A) xy ra l bao nhiu? Ni cch khc, chnh l phng php i tm tr s P. Do , din dch tr s P phi c iu kin, v iu kin y l p = 0.50. Bn c c th lm th nghim thm vi p = 0.6 hay p = 0.7 thy kt qu khc nhau ra sao. Trong thc t, tr s P c mt nh hng rt ln n s phn ca mt bi bo khoa hc. Nhiu tp san v nh khoa hc xem mt nghin cu khoa hc vi tr s P cao hn 0.05 l mt kt qu tiu cc (negative result) v bi bo c th b t chi cho cng b. Chnh v th m i vi i a s nh khoa hc, con s P < 0.05 tr thnh mt ci giy thng hnh cng b kt qu nghin cu. Nu kt qu vi P < 0.05, bi bo c c may xut hin trn mt tp san no v tc gi c th s ni ting; nu kt qu P > 0.05, s phn bi bo v cng trnh nghin cu c c may i vo lng qun.

7.4 Vn logic ca tr s PNhng ng trn phng din l tr v khoa hc nghim chnh, chng ta c nn t tm quan trng vo tr s P nh th hay khng? Cu tr li l khng.

78

Phn tch d liu v to biu bng R Nguyn Vn Tun

Tr s P c nhiu vn , v vic ph thuc vo n trong qu kh (cng nh hin nay) b rt nhiu ngi ph phn gay gt. Ci khim khuyt ln nht ca tr s P l n thiu tnh logic. Tht vy, nu chng ta chu kh xem xt li v d trn, chng ta c th khi qut tin trnh ca mt nghin cu y hc (da vo tr s P) nh sau: ra mt gi thuyt chnh (H+) T gi thuyt chnh, ra mt gi thuyt o (H-) Tin hnh thu thp d kin (D) Phn tch d kin: tnh ton xc sut D xy ra nu H- l s tht. Ni theo ngn ng ton xc sut, bc ny chnh l bc tnh ton tr s P hay P(D | H-).

V th, con s P c ngha l xc sut ca d kin D xy ra nu (nhn mnh: nu) gi thuyt o H- l s tht. Nh vy, con s P khng trc tip cho chng ta mt nim g v s tht ca gi thuyt chnh H; n ch gin tip cung cp bng chng chng ta chp nhn gi thuyt chnh v bc b gi thuyt o. Ci logic ng sau ca tr s P c th c hiu nh l mt tin trnh chng minh o ngc (proof by contradiction): Mnh 1: Nu gi thuyt o l s tht, th d kin ny khng th xy ra; Mnh 2: D kin xy ra; Mnh 3 (kt lun): Gi thuyt o khng th l s tht. Nu ng Tun b cao huyt p, th ng khng th c triu chng rng tc (hai hin tng sinh hc ny khng lin quan vi nhau, t ra l theo kin thc y khoa hin nay); ng Tun b rng tc; Do , ng Tun khng th b cao huyt p.

Nu cch lp lun trn kh hiu, chng ta th xem mt v d c th nh sau:

Tr s P, do , gin tip phn nh xc sut ca mnh 3. V cng chnh l mt khim khuyt quan trng ca tr s P, bi v con s P n c tnh mc kh d ca d kin, ch khng ni cho chng ta bit mc kh d ca mt gi thuyt. iu ny lm cho vic suy lun da vo tr s P rt xa ri vi thc t, xa ri vi khoa hc thc nghim. Trong khoa hc thc nghim, iu m nh nghin cu mun bit l vi d kin m h c c, xc sut ca gi thuyt chnh l bao nhiu, ch h khng mun bit nu gi thuyt o l s tht th xc sut ca d kin l bao nhiu. Ni cch khc v dng k hiu m t trn, nh nghin cu mun bit P(H+ | D), ch khng mun bit P(D | H+) hay P(D | H-).

79

7.5. Vn kim nh nhiu gi thuyt (multiple tests of hypothesis)Nh ni trn, nghin cu y hc l mt qui trnh th nghim gi thuyt. Trong mt nghin cu, t khi no chng ta th nghim ch mt gi thuyt duy nht, m rt nhiu gi thuyt mt lc. Chng hn nh trong mt nghin cu v mi lin h gia vitamin D v nguy c gy xng i, cc nh nghin cu c th phn tch mi lin h tng quan gia vitamin D v mt xng (bone mineral density), gia vitamin D v nguy c gy xng theo tng gii tnh, tng nhm tui, hay phn tch theo cc c tnh lm sng ca bnh nhn, v.v (Xem v d di y). Mi mt phn tch nh th c th xem l mt th nghim gi thuyt. y, chng ta phi i din vi vn nhiu gi thuyt (multiple tests of hypothesis hay cn gi l multiple comparisons). Bng 2. Phn tch hiu qu ca vitamin D v calcium theo c tnh ca bnh nhn c tnh bnh nhn Nhm c Nhm gi iu tr bng dc calcium v (placebo) 1 vitamin D 1 tui 50-59 29 (0.06) 13 (0.03) 60-69 53 (0.09) 71 (0.13) 70-79 93 (0.44) 115 (0.54) T trng c th (Body mass index) 30 Ht thuc l Khng ht thuc Hin ht thuc T s nguy c (relative risk) v khong tin cy 95% 2 2,17 (1.13-4.18) 0.74 (0.52-1.06) 0.82 (0.62-1.08)

69 (0.20) 63 (0.14) 43 (0.09)

66 (0.19) 74 (0.16) 59 (0.13)

1.05 (0.75-1.47) 0.87 (0.62-1.22) 0.73 (0.49-1.09)

159 (0.14) 14 (0.14)

178 (0.15) 16 (0.17)

0.90 (0.71-1.11) 0.85 (0.41-1.74)

Ch thch: 1 s ngoi ngoc l s bnh nhn b gy xng i trong thi gian theo di (7 nm) v s trong ngoc l t l gy xng tnh bng phn trm mi nm. 2 T s nguy c tng i (hay relative risk RR s gii thch trong mt

80

Phn tch d liu v to biu bng R Nguyn Vn Tun

chng sau) c c tnh bng cch ly t l gy xng trong nhm can thip chia cho t l trong nhm gi dc; nu khong tin cy 95% bao gm 1 th mc khc bit gia 2 nhm khng c ngha thng k; nu khong tin cy 95% khng bao gm 1 th mc khc bit gia 2 nhm c xem l c ngha thng k (hay p par(mfrow=c(2,2)) > N x > > > > >

y > > > > par(mfrow=c(1,2)) N plot(x, y, xlab=Time, ylab=Production) > title(main=Plot of production and x factor, sub=Figure 1)

86

Phn tch d liu v to biu bng R Nguyn Vn Tun

Plot of production and x factor2 Production -2 -4 -1 0 1

-2

0 X factor Figure 1

2

4

8.1.3 Cho gii hn ca trc tung v trc honhNu khng cung cp gii hn ca trc tung v trc honh, R s t ng tm iu chnh v cho cc s liu ny. Tuy nhin, chng ta cng c th kim sot biu bng cch s dng xlim v ylim cho R bit c th gii hn ca hai trc ny:> plot(x, y, xlab=X factor, ylab=Production, main=Plot of production and x factor, xlim=c(-5, 5), ylim=c(-3, 3))

8.1.4 Th loi v ng biu dinTrong mt dy biu , chng ta c th yu cu R v nhiu kiu v ng biu din khc nhau.> > > > > par(mfrow=c(2,2)) plot(y, type="l"); plot(y, type="b"); plot(y, type="o"); plot(y, type="h"); title("lines") title("both") title("overstruck") title("high density")

87

lines2 2

both

1

0

y

-1

y 0 50 100 Index 150 200

-2

-2 0

-1

0

1

50

100 Index

150

200

overstruck2 2

high density

1

0

y

-1

y 0 50 100 Index 150 200

-2

-2 0

-1

0

1

50

100 Index

150

200

Biu 3. Kiu biu v ng biu din. Ngoi ra, chng ta cng c th biu din nhiu ng bng lty nh sau:>

par(mfrow=c(2,2))

> plot(y, type="l", lty=1); title(main="Production data", sub="lty=1") > plot(y, type="l", lty=2); title(main="Production data", sub="lty=2") > plot(y, type="l", lty=3); title(main="Production data", sub="lty=3") > plot(y, type="l", lty=4); title(main="Production data", sub="lty=4")

88

Phn tch d liu v to biu bng R Nguyn Vn Tun

Production data2 2

Production data

1

0

y

-1

y 0 50 100 Index lty=1 150 200

-2

-2 0

-1

0

1

50

100 Index lty=2

150

200

Production data2 2

Production data

1

0

y

-1

y 0 50 100 Index lty=3 150 200

-2

-2 0

-1

0

1

50

100 Index lty=4

150

200

Biu 4. nh hng ca lty.

8.1.5 Mu sc, khung, v k hiuChng ta c th kim sot mu sc ca mt biu bng lnh col. Gi tr mc nh ca col l 1. Tuy nhin, chng ta c th thay i cc mu theo mun hoc bng cch cho s hoc bng cch vit ra tn mu nh red, blue, green, orange, yellow, cyan, v.v V d sau y dng mt hm v ba ng biu din vi ba mu , xanh nc bin, v xanh l cy:> plot(runif (10), ylim=c(0,1), type='l') > for (i in c('red', 'blue', 'green')) { lines(runif (10), col=i ) } > title(main="Lines in various colours")

89

Lines in various colours1.0 runif(10) 0.0 0.2 0.4 0.6 0.8

2

4 Index

6

8

10

Ngoi ra, chng ta cn c th v ng biu din bng cch tng b dy ca mi ng:> plot(runif(5), ylim=c(0,1), type='n') > for (i in 5:1) { lines( runif(5), col=i, lwd=i ) } > title(main="Varying the line thickness")Varying the line thickness1 .0 ru if(5 n ) 0 .0 0 .2 0 .4 0 .6 0 .8

1

2

3 Index

4

5

Hnh dng ca biu cng c th thay i bng type nh sau: > op plot(runif(5), type = 'p', main = "plot type 'p' (points)") > plot(runif(5), type = 'l',

90

Phn tch d liu v to biu bng R Nguyn Vn Tun

main = "plot > plot(runif(5), main = "plot > plot(runif(5), main = "plot > plot(runif(5), main = "plot > plot(runif(5), main = "plot > par(op)

type type type type type type type type type

'l' (lines)") = 'b', 'b' (both points and lines)") = 's', 's' (stair steps)") = 'h', 'h' (histogram)") = 'n', 'n' (no plot)")

plot type 'p' (points)0.9 0.9

plot type 'l' (lines)

runif(5)

runif(5) 1 2 3 Index 4 5

0.7

0.5

0.3

0.3 1

0.5

0.7

2

3 Index

4

5

plot type 'b' (both points a nd line s)0.8

plot type 's' (sta ir steps)

runif(5)

runif(5) 1 2 3 Index 4 5

0.6

0.4

0.2

0.4 1

0.6

0.8

2

3 Index

4

5

plot type 'h' (histogra m)0.4

plot type 'n' (no plot)

runif(5)

0.3

0.2

runif(5) 1 2 3 Index 4 5

0.1

0.2 1

0.4

0.6

2

3 Index

4

5

Khung biu c th kim sot bng lnh bty vi cc thng s nh sau:bty=n bty=o bty=c bty=l bty=7

Khng c vng khung chung quanh biu C 4 khung chung quanh biu V mt hp gm 3 cnh chung quanh biu theo hnh ch C V hp 2 cnh chung quanh biu theo hnh ch L V hp 2 cnh chung quanh biu theo hnh s 7 Cch hay nht bn c lm quen vi cc cch v biu ny l bng cch th trn R bit r hn. K hiu ca mt biu cng c th thay th bng cch cung cp s cho pch (plotting character) trong R. Cc k hiu thng dng l:

91

Available symbols

21

22

23

24

25

16

17

18

19

20

11

12

13

14

15

6

7

8

9

10

1

2

3

4

5

> plot(x, y, col=red, pch=16, bty=l)

y

-2 -4

-1

0

1

2

-2

0 x

2

4

Biu 4. nh hng ca pch=16 v col=red, bty=l.

8.1.6 Ghi ch (legend)

92

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hm legend rt c ch cho vic ghi ch mt biu v gip ngi c hiu c ngha ca biu tt hn. Cch s dng legend c th minh ha bng v d sau y:> N x y plot(x,y, pch=16, main=Scatter plot of y and x) > reg abline(reg) > legend(2,-2, c("Production","Regression line"), pch=16, lty=c(0,1))

Thng s legend(2,-2) c ngha l t phn ghi ch vo trc honh (xaxis) bng 2 v trc tung (y-axis) bng -2.Scatter plot of y and x

y -2 0

2

4

*

Production Regression line

-4 -4

-2

0 x

2

4

Biu 5. nh hng ca legend

8.1.7 Vit ch trong biu

Phn ln cc biu khng cung cp phng tin vit ch hay ghi ch trong biu , hay c cung cp

93

nhng rt hn ch. Trong R c hn mtext() cho php chng ta t ch vit hay gii thch bn cnh hay trong biu . Bt u t pha di ca biu (side=1), chng ta chuyn theo hng kim ng h n cnh s 4. Lnh plot trong v d sau y khng in