introduction to statistics and rveda.cs.uiuc.edu/compgen2018/lecture/01_statistics.pdf ·...

Introduction to Statistics and R

Mayo-Illinois Computational Genomics Workshop (2018)

Ruoqing Zhu, Ph.D. <[email protected]>

https://sites.google.com/site/teazrq/teaching/STAT542

Department of StatisticsUniversity of Illinois at Urbana-ChampaignJune 18, 2018

1/49

mailto:[email protected]

https://sites.google.com/site/teazrq/teaching/STAT542

Overview

• Basic Concepts• Using R and RStudio• Data• Descriptive Statistics

• Hypothesis Testing• The concept and procedure• P-value and its interpretation• Example 1: two sample t-test• Example 2: simple linear regression• Example 3: multiple linear regression• Example 4: multiple testing• Example 5: high-dimensional data

2/49

Basic Concepts

Using R and RStudio

• R is a free software environment for statistical computing andgraphics.

— https://cran.r-project.org/

• RStudio is an integrated development environment (IDE) for R.— https://www.rstudio.com/

• R packages are “add-ons” for R that offers additional datasetsand functionalities. They can be managed within RStudio.

• A supplementary material (including all R codes) of this lecturenote is available.

4/49

https://cran.r-project.org/

https://www.rstudio.com/

Using R and RStudio

5/49

Basic Operations

• R comes with some standard mathematical and statisticalfunctionalities, such as basic calculations, and regressions.

• We will walk through some basics, especially, for dealing withdata.

• Then we will use some descriptive statistics for a dataset

6/49

Basic Operations

1 > # Basic mathematical f u nc t i o ns2 > 1 + 33 [ 1 ] 44

5 > 3*56 [ 1 ] 157

8 > 3ˆ59 [ 1 ] 243

10

11 > exp ( 2 )12 [ 1 ] 7.38905613

14 > log ( 3 )15 [ 1 ] 1.09861216

17 > log2 ( 3 )18 [ 1 ] 1.58496319

20 > f a c t o r i a l ( 5 )21 [ 1 ] 120

7/49

Basic Operations

1 > # c rea t i ng a vec to r2 > c (1 ,2 ,3 ,4 )3 [ 1 ] 1 2 3 44

5 > c ( ” a ” , ” b ” , ” c ” )6 [ 1 ] ” a ” ” b ” ” c ”7

8 > # c rea t i ng a mat r i x from a vec to r9 > mat r i x ( c (1 ,2 ,3 ,4 ) , 2 , 2)

10 [ , 1 ] [ , 2 ]11 [ 1 , ] 1 312 [ 2 , ] 2 413

14 > x = c (1 ,1 ,1 ,0 ,0 ,0 ) ; y = c (1 ,0 ,1 ,0 ,1 ,0 )15 > cbind ( x , y )16 x y17 [ 1 , ] 1 118 [ 2 , ] 1 019 [ 3 , ] 1 120 [ 4 , ] 0 021 [ 5 , ] 0 122 [ 6 , ] 0 0

8/49

Basic Operations

1 > # some simple opera t ions2 > x [ 3 ]3 [ 1 ] 14 > x [ 2 : 5 ]5 [ 1 ] 1 1 0 06

7 > # subse t t i ng a mat r i x8 > cbind ( x , y ) [ 1 : 2 , ]9 x y

10 [ 1 , ] 1 111 [ 2 , ] 1 012

13 > # element−wise opera t ions14 > ( x + y ) ˆ215 [ 1 ] 4 1 4 0 1 016

17 > # dimensions18 > l eng th ( x )19 [ 1 ] 620 > dim ( cbind ( x , y ) )21 [ 1 ] 6 2

9/49

Basic Operations

1 > # A warning w i l l be issued when R detec ts something wrong . Resul ts mays t i l l be produced .

2 > x + c (1 ,2 ,3 ,4 )3 [ 1 ] 2 3 4 4 1 24 Warning message :5 In x + c (1 , 2 , 3 , 4) :6 l onger ob jec t leng th i s not a m u l t i p l e o f sho r te r ob jec t leng th7 > # you can get re ference of a f u n c t i o n using a quest ion mark i n f r o n t o f

i t8 > ?mean9 > ? t . t e s t

10/49

Descriptive Statistics

• When we are facing a new dataset, the first thing to do is tosummarize the data using descriptive statistics

• These includes, but not limited to• mean, median, quantiles• variance, standard deviation• correlation

• Sometimes figures and plots can help quickly understand patternin a data.

11/49


• We will use a Prostate Cancer Data ( prostate ) from theElemStatLearn package.

• This dataset comes from a study by Stamey et al. (1989) thatexamined the prostate specific antigen (PSA) and some clinicalmeasures in 97 men.

• You will need to first install the package since it is not included inthe base R version.

1 i n s t a l l . packages ( ” ElemStatLearn ” )

12/49


• After installing the package, we can load the data

• head() is a function that displays the first several rows of the data

1 > l i b r a r y ( ElemStatLearn )2 > data ( p ros ta te )3 > # we w i l l remove the l a s t column since i t s not used i n t h i s ana lys i s4 > pros ta te = p ros ta te [ , −10]5 > dim ( p ros ta te )6 [ 1 ] 97 97 > head ( round ( pros ta te , 3) )8 l c a v o l lwe igh t age lbph s v i l cp gleason pgg45 lpsa9 1 −0.580 2.769 50 −1.386 0 −1.386 6 0 −0.431

10 2 −0.994 3.320 58 −1.386 0 −1.386 6 0 −0.16311 3 −0.511 2.691 74 −1.386 0 −1.386 7 20 −0.16312 4 −1.204 3.283 58 −1.386 0 −1.386 6 0 −0.16313 5 0.751 3.432 62 −1.386 0 −1.386 6 0 0.37214 6 −1.050 3.229 50 −1.386 0 −1.386 6 0 0.765

13/49


• The dataset contains 8 clinical variables and 1 outcome (“lpsa”).Among them, “svi” and “gleason” are categorical variables.

Name Attributelcavol log cancer volumelweight log prostate weightage age in yearslbph log of benign prostatic hyperplasia amountsvi seminal vesicle invasionlcp log of capsular penetrationgleason Gleason scorepgg45 percent of Gleason scores 4 or 5lpsa log of prostate specific antigen (outcome)

14/49


1 > summary ( p ros ta te )2 l c a v o l lwe igh t age3 Min . :−1.3471 Min . :2 .375 Min . :41 .004 1 s t Qu . : 0.5128 1 s t Qu. : 3 . 3 7 6 1 s t Qu. : 6 0 . 0 05 Median : 1.4469 Median :3 .623 Median :65.006 Mean : 1.3500 Mean :3 .629 Mean :63.877 3rd Qu . : 2.1270 3rd Qu. : 3 . 8 7 6 3rd Qu. : 6 8 . 0 08 Max . : 3.8210 Max . :4 .780 Max . :79.009

10 lbph s v i l cp11 Min . :−1.3863 Min . :0.0000 Min . :−1.386312 1 s t Qu.:−1.3863 1 s t Qu. :0 .0000 1 s t Qu.:−1.386313 Median : 0.3001 Median :0.0000 Median :−0.798514 Mean : 0.1004 Mean :0.2165 Mean :−0.179415 3rd Qu . : 1.5581 3rd Qu. :0 .0000 3rd Qu . : 1.178716 Max . : 2.3263 Max . :1.0000 Max . : 2.904217

18 gleason pgg45 lpsa19 Min . :6 .000 Min . : 0.00 Min . :−0.430820 1 s t Qu. : 6 . 0 0 0 1 s t Qu . : 0.00 1 s t Qu . : 1.731721 Median :7 .000 Median : 15.00 Median : 2.591522 Mean :6 .753 Mean : 24.38 Mean : 2.478423 3rd Qu. : 7 . 0 0 0 3rd Qu . : 40.00 3rd Qu . : 3.056424 Max . :9 .000 Max . :100.00 Max . : 5.5829

15/49


1 > # mean and sd are standard s t a t i s t i c a l f u n c t i o n s2 > mean( p ros ta te $ l c a v o l )3 [ 1 ] 1.350014 > sd ( p ros ta te $ l c a v o l )5 [ 1 ] 1.1786256 > # count f requenc ies f o r c a t e g o r i c a l v a r i a b l e s7 > t ab l e ( p ros ta te $gleason )8 6 7 8 99 35 56 1 5

10 > # we can view a s i n g l e cont inous v a r i a b l e using histogram11 > h i s t ( p ros ta te $ l c a v o l )

Histogram of prostate$lpsa

prostate$lpsa

Frequency

-1 0 1 2 3 4 5 6

010

2030

40

16/49

Scatter Plot and Correlation

• One commonly used method for analyzing two continuousvariables is the Pearson’s correlation coefficient

• The correlation of two random variables X and Y is formallydefined as

E[(X − E(X)

)(Y − E(Y )

)]√Var(X)Var(Y )

• But what does this mean? Let’s look at the scatter plot of the twovariables: lcavol and lpsa

17/49


-1 0 1 2 3 4

01

23

45

cancer volume

pros

tate

spe

cific

ant

igen

Correlation can be calculated using the cor() function1 > cor ( p ros ta te $ l cavo l , p ros ta te $ lpsa )2 [ 1 ] 0.7344603

18/49


• There is obviously an association between the two variables.

• As lcavol (the horizontal variable) increases, lpsa (the verticalvariable) also tend to increase

• That pattern indicates that there is a positive correlation (denotedas ρXY ) between the two random variables.

• But how strong that relationship is? That is quantified bycorrelation coefficient.

• Here are some different correlation patterns.

19/49


-3 -2 -1 0 1 2 3 4

-20

24

Strong Positive Correlation (0.9)

-3 -2 -1 0 1 2 3 4

-6-4

-20

24

6

Moderate Positive Correlation (0.5)

-3 -2 -1 0 1 2 3 4

05

1015

(Nearly) Zero Correlation

-3 -2 -1 0 1 2 3 4

-3-2

-10

12

(Nearly) Zero Correlation

-3 -2 -1 0 1 2 3 4

-6-4

-20

24

6Moderate Negative Correlation (-0.5)

-3 -2 -1 0 1 2 3 4

-3-2

-10

12

3

Strong Negative Correlation (-0.9)

20/49


• The previous plots represent 6 different correlation patterns.

• The most interesting one is the top right and bottom left ones,both with correlations close to 0, however, due to differentreasons.

• The bottom left is due to independence — the two variables arenot related whatsoever.

• The top right is due to symmetry. The two variables are clearlyassociated, however, that association cannot bedetected/captured by a linear relationship

21/49

Summary about correlations

• Pearson’s correlation coefficient describes linear associationsbetween two variables

• It is bounded between −1 and 1. More extreme values meanstronger associations.

• Some nonlinear relationships may not be detected by Pearson’scorrelation coefficient. Solutions?

• transformations of the variable may help• other measures maybe more sensitive, e.g. distance correlation

• Correlation DOSE NOT mean causality

22/49

All pairwise correlations

• When we have multiple variables, it is convenient to look at allpairwise correlations

• Diagonal elements are always 0 for a correlation matrix becauseany variable is perfectly correlated with itself.

1 > round ( cor ( p ros ta te ) , 3)2 l c a v o l lwe igh t age lbph s v i l cp gleason pgg45 lpsa3 l c a v o l 1.000 0.281 0.225 0.027 0.539 0.675 0.432 0.434 0.7344 lwe igh t 0.281 1.000 0.348 0.442 0.155 0.165 0.057 0.107 0.4335 age 0.225 0.348 1.000 0.350 0.118 0.128 0.269 0.276 0.1706 lbph 0.027 0.442 0.350 1.000 −0.086 −0.007 0.078 0.078 0.1807 s v i 0.539 0.155 0.118 −0.086 1.000 0.673 0.320 0.458 0.5668 l cp 0.675 0.165 0.128 −0.007 0.673 1.000 0.515 0.632 0.5499 gleason 0.432 0.057 0.269 0.078 0.320 0.515 1.000 0.752 0.369

10 pgg45 0.434 0.107 0.276 0.078 0.458 0.632 0.752 1.000 0.42211 lpsa 0.734 0.433 0.170 0.180 0.566 0.549 0.369 0.422 1.000

23/49

All pairwise correlations

1 > # t h i s i s done using another package2 > l i b r a r y ( c o r r p l o t )3 > c o r r p l o t ( cor ( p ros ta te ) , type = ” upper ” , t l . co l = ” b lack ” , t l . s r t = 35)

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1lcavollweight

age

lbph

svi lcp gle

ason

pgg45lpsa

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

lpsa

24/49

Correlation Matrix for Genomic Data

• In genomic, the features are often ordered in a meaningful way

• Correlation plots can help see the patterns

Example: Linkage disequilibrium plot from Pistis et. al. (2013)

25/49

Hypothesis Testing

Overview

Why do we need statistics?

— Analyze and extract information from data, and make inference

— A rigourous mathematical framework that addresses pitfalls inthe data collection process

— Model fitting, variable selection and hypothesis testing

— Use statistical models in scientific discoveries

27/49

The Lady Tasting Tea Problem

The Lady Tasting Tea: A real story about the origin of statistics

• In 1920s Cambridge, England, aLady, named Muriel Bristol, claimedto be able to tell whether the tea orthe milk was added first!

• A statistician Ronald A. Fisherdesigned a method to test whethershe had that ability

“The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century” (2001) by David Salsburg

28/49

Fisher’s exact test

• 8 cups of tea were prepared, 4 with milkadded first and 4 with tea added first.The lady was asked to identify the 4 cupsprepared by one method.

• There are totally 8!4!(8−4)! = 70 possible

results.

• IF she does not have the ability to identifythem, and just do a random guess, theprobability of successfully identifying...

• At least 2 cups: 53/70 ≈ 75.7%

• At least 3 cups: 17/70 ≈ 24.3%

• All 4 cups: 1/70 ≈ 1.4%

• She identified all 4 cups! Whatconclusion can we draw? Is it “likely” thatshe has the ability?

Sir Ronald A. Fisher(1890 - 1962)

29/49

The Lady Tasting Tea

Some key steps in this story (hypothesis testing):

1). Form null and alternative hypotheses:

Null H0 : Random Guessing vs. Alt H1 : Not Random Guessing

2). Decide a significance level, e.g. α = 0.05.

3). Perform an experiment and observe that the lady identified the 4correctly.

4). If the Null hypothesis is correct, there is only 1.4% chance thatone can guess 4 correctly

5). This is a “small probability event” (< α), so we will make aconclusion (reject the Null).

30/49

The Lady Tasting Tea

• These steps are known as the hypothesis testing

• The 1.4% that we calculated is called the p-value• Intuitive Definition: If the Null hypothesis is correct, what is the

chance to observe what we observed (or even more extremevalues)

• However, does rejecting the Null hypothesis means that the ladyactually has the ability to tell the difference?

• No — whatever conclusion that we make, it could be just due torandomness (she is lucky or unlucky)

31/49

Example 1: two sample t test

• I am interested in testing whether the binary variable svi(seminal vesicle invasion) is associated with the continuousoutcome lpsa.

• Null H0: Means of lpsa are the same for different svi status• Alt H1: They are not the same

• The can be done using a two-sample t test.• Consider subjects with svi = 0 as one set of samples, and svi

= 1 as the other set of samples, both with lpsa measured.

• The test can be done using the function t.test()

32/49


• One way to do it is to create two vectors, corresponds to thesamples of each group.

1 > g1 = pros ta te $ lpsa [ p ros ta te $ s v i == 1]2 > g0 = pros ta te $ lpsa [ p ros ta te $ s v i == 0]3 > t . t e s t ( g0 , g1 )4

5 Welch Two Sample t−t e s t6

7 data : g0 and g18 t = −6.8578, d f = 33.027 , p−value = 7.879e−089 a l t e r n a t i v e hypothes is : t r ue d i f f e r e n c e i n means i s not equal to 0

10 95 percent conf idence i n t e r v a l :11 −2.047129 −1.11040912 sample est imates :13 mean of x mean of y14 2.136592 3.715360

33/49


• In R, some alternative coding will give us the exactly same result

1 > t . t e s t ( lpsa ~ sv i , data = p ros ta te )2

3 Welch Two Sample t−t e s t4

5 data : lpsa by s v i6 t = −6.8578, d f = 33.027 , p−value = 7.879e−087 a l t e r n a t i v e hypothes is : t r ue d i f f e r e n c e i n means i s not equal to 08 95 percent conf idence i n t e r v a l :9 −2.047129 −1.110409

10 sample est imates :11 mean i n group 0 mean i n group 112 2.136592 3.715360

34/49


• The p-value is 7.879e-08 (highly significant), meaning that wewill reject the Null hypothesis

• Note that this function makes some assumptions (we will notcover) by default, in particular, unequal variances.

• Important note: rejecting the Null hypothesis DOSE NOT meanthat the Null hypothesis is indeed wrong — similar to the ladytasting tea problem.

• How do we interpret this p-value?

35/49


• It is possible to make a wrong conclusion using hypothesistesting procedures

H0 is true H0 is falseAccept H0 X Type II ErrorReject H0 Type I Error X

• Four situations:• If H0 is true, we could still reject H0 with α probability (bad); acceptH0 with 1− α probability (good)

• If H0 is false, its difficult to know exactly the probabilities unless wemake some more assumptions. However, its still possible to makemistakes regardless.

• Knowing what statistics can or cannot do is very crucial.

36/49

Example 2: regression model

• Suppose that lpsa can be model by the svi status through arelationship

lpsa ∼ β0 + β1svi

• This is a regression model, and also a one-way ANOVA

• If β1 is nonzero (we do not know), then svi will affect lpsa• Expected lpsa when svi = 0 is β0• Expected lpsa when svi = 1 is β0 + β1

• • Null H0: β1 = 0

• Alt H1: β1 6= 0

37/49


1 > f i t = glm ( lpsa ~ sv i , data = pros ta te , f a m i l y = ” gaussian ” )2 > summary ( f i t )3

4 Ca l l :5 glm ( formula = lpsa ~ sv i , f a m i l y = ” gaussian ” , data = p ros ta te )6 Deviance Residuals :7 Min 1Q Median 3Q Max8 −2.56737 −0.64035 −0.00301 0.66979 1.893219

10 C o e f f i c i e n t s :11 Est imate Std . E r ro r t value Pr(>| t | )12 ( I n t e r c e p t ) 2.1366 0.1097 19.474 < 2e−16 * * *13 s v i 1.5788 0.2358 6.696 1.5e−09 * * *14 −−−15 S i g n i f . codes : 0 * * * 0 .001 * * 0 .01 * 0 .05 . 0 .1 116

17 ( D ispers ion parameter f o r gaussian f a m i l y taken to be 0.9148091)18 Nu l l deviance : 127.918 on 96 degrees o f freedom19 Residual deviance : 86.907 on 95 degrees o f freedom20 AIC : 270.6221

22 Number o f F isher Scor ing i t e r a t i o n s : 2

38/49


• lpsa ∼ svi specifies the regression model. The intercept isincluded by default

• data = prostate specifies the dataset that contains the variableslpsa and svi

• family = ”gaussian” tells R that the outcome variable is acontinuous data type (Gaussian distribution).

• The result shows that we have significant evidence (α = 0.05) toclaim that lpsa level will be different

1 Est imate Std . E r ro r t value Pr(>| t | )2 ( I n t e r c e p t ) 2.1366 0.1097 19.474 < 2e−16 * * *3 s v i 1.5788 0.2358 6.696 1.5e−09 * * *

39/49

Example 3: multiple regression model

• What if we have many variables that we want to use to model theoutcome?

• Some of them may be continuous, some are categorical. Forexample,

lpsa ∼ β0 + β1svi+ β2lcavol+ β3age

• This allows us to analyze the effect of each variable byconsidering the effect of others.

• Why this is not the same as running them separatively? e.g.Simpson’s paradox

40/49


• The effect of a variable (x) on the outcome (y) may change if wefurther conditioning on a third variable (red/blue group)

(Figure from Wikipedia)

41/49


1 > f i t = glm ( lpsa ~ s v i + l c a v o l + age , data = pros ta te , f a m i l y = ”gaussian ” )

2 > summary ( f i t )3

4 Ca l l :5 glm ( formula = lpsa ~ s v i + l c a v o l + age , f a m i l y = ” gaussian ” ,6 data = p ros ta te )7

8 Deviance Residuals :9 Min 1Q Median 3Q Max

10 −1.6112 −0.5279 0.1208 0.4890 1.687211 C o e f f i c i e n t s :12 Est imate Std . E r ro r t value Pr(>| t | )13 ( I n t e r c e p t ) 1.4808297 0.6725103 2.202 0.03014 *14 s v i 0.6698231 0.2223268 3.013 0.00333 * *15 l c a v o l 0.5913356 0.0795912 7.430 5.17e−11 * * *16 age 0.0008492 0.0106885 0.079 0.9368517 −−−18 S i g n i f . codes : 0 * * * 0 .001 * * 0 .01 * 0 .05 . 0 .1 1

42/49


• lcavol seems to be a very strong predictor for the outcomelpsa, and it “took away” some of the effects explained by svi inthe previous model

• As we keep adding/removing variables, the effects (parameters)may change (even dramatically).

• However, we usually do not want to include too many variables(relatively to the sample size) because that is going to make themodel very “fragile” — very unstable parameter estimates.

43/49

Example 4: multiple testing

• The more hypothesis tests we perform, the more likely that wewill see a significant p-value just by chance, even if the null istrue.

• Every one who is sitting in this classroom can participant in thetea tasting experiment and select the cups randomly. In thatcase, we will probably see a few people who can identify all cupscorrectly.

• Performing thousands of tests on gene expressions one by onemay allow us to find many significant results, however, they couldall just due to randomness.

44/49

Example 4: multiple testing

1 > # we generate independent x and y values so t h a t they are notassoc iated

2 > pvalues = rep (NA, 1000)3 > f o r ( i i n 1:1000)4 + {5 + x = rbinom (20 , 1 , 0 .5 )6 + y = rnorm (20)7 + pvalues [ i ] = t . t e s t ( y~x ) $p . value8 + }9 > # however , there w i l l be a la rge number o f them tu rn out s i g n i f i c a n t

10 > t ab l e ( pvalues <= 0.05)11

12 FALSE TRUE13 946 54

45/49

Example 5: high-dimensional data

• When fitting a linear regression model with too many variables(e.g. gene expressions) we could overfit the model badly.

• This is particularly true when the number of variables can exceedthe number of observations (see an example in thesupplementary file)

• Some machine learning models can handle such situationsbetter — they tend to select only the important variables andeliminate noise variables.

• The Lasso and Random Forets are two popular models

46/49


1 > # 55 observat ions wi th 50 v a r a i b l es2 > set . seed ( 1 ) ; n = 55; p = 503 > x = mat r i x ( rnorm ( n*p ) , n , p )4 > # y i s independent o f x5 > y = rnorm ( n )6 > a l l c o e f = summary ( lm ( y~x ) ) $ c o e f f i c i e n t s7 >

8 > # j u s t the v a r a i b l e s t h a t est imated to be s i g n i f i c a n t9 > a l l c o e f [ a l l c o e f [ , 4 ] < 0.05 , ]

10

11 Est imate Std . E r ro r t value Pr(>| t | )12 x8 1.7322432 0.5750871 3.012140 0.0394671213 x13 −2.4388295 0.7739399 −3.151187 0.0344746514 x16 −1.7217647 0.4904563 −3.510537 0.0246599415 x17 2.0557598 0.6792156 3.026667 0.0389076016 x20 0.9637801 0.2222795 4.335892 0.0122925817 x24 −1.2110275 0.3670152 −3.299666 0.0299427918 x33 0.6050861 0.2096115 2.886703 0.0447107619 x41 0.9111477 0.2967598 3.070321 0.0372813120 x43 1.2065449 0.3980097 3.031446 0.0387255921 x45 1.2441839 0.3081686 4.037348 0.0156382422 x47 1.1829540 0.3031894 3.901700 0.01751626

47/49


1 > # the lasso model i s implemented i n the ” glmnet ” package2 > l i b r a r y ( glmnet )3 > # we use a cross−v a l i d a t i o n to s e l e c t the best model4 > lasso . f i t = cv . glmnet ( x , y )5 > # the lasso model est imates a l l c o e f f i c i e n t s to be 0 except the

i n t e r c e p t term6 > sum( as . numeric ( coef ( lasso . f i t ) ) ! = 0)7 [ 1 ] 18 >

48/49

Questions?

49/49

introduction to statistics and rveda.cs.uiuc.edu/compgen2018/lecture/01_statistics.pdf ·...

Documents