introduction to statistics and rveda.cs.uiuc.edu/compgen2018/lecture/01_statistics.pdf ·...
TRANSCRIPT
Introduction to Statistics and R
Mayo-Illinois Computational Genomics Workshop (2018)
Ruoqing Zhu, Ph.D. <[email protected]>
https://sites.google.com/site/teazrq/teaching/STAT542
Department of StatisticsUniversity of Illinois at Urbana-ChampaignJune 18, 2018
1/49
Overview
• Basic Concepts• Using R and RStudio• Data• Descriptive Statistics
• Hypothesis Testing• The concept and procedure• P-value and its interpretation• Example 1: two sample t-test• Example 2: simple linear regression• Example 3: multiple linear regression• Example 4: multiple testing• Example 5: high-dimensional data
2/49
Basic Concepts
Using R and RStudio
• R is a free software environment for statistical computing andgraphics.
— https://cran.r-project.org/
• RStudio is an integrated development environment (IDE) for R.— https://www.rstudio.com/
• R packages are “add-ons” for R that offers additional datasetsand functionalities. They can be managed within RStudio.
• A supplementary material (including all R codes) of this lecturenote is available.
4/49
Using R and RStudio
5/49
Basic Operations
• R comes with some standard mathematical and statisticalfunctionalities, such as basic calculations, and regressions.
• We will walk through some basics, especially, for dealing withdata.
• Then we will use some descriptive statistics for a dataset
6/49
Basic Operations
1 > # Basic mathematical f u nc t i o ns2 > 1 + 33 [ 1 ] 44
5 > 3*56 [ 1 ] 157
8 > 3ˆ59 [ 1 ] 243
10
11 > exp ( 2 )12 [ 1 ] 7.38905613
14 > log ( 3 )15 [ 1 ] 1.09861216
17 > log2 ( 3 )18 [ 1 ] 1.58496319
20 > f a c t o r i a l ( 5 )21 [ 1 ] 120
7/49
Basic Operations
1 > # c rea t i ng a vec to r2 > c (1 ,2 ,3 ,4 )3 [ 1 ] 1 2 3 44
5 > c ( ” a ” , ” b ” , ” c ” )6 [ 1 ] ” a ” ” b ” ” c ”7
8 > # c rea t i ng a mat r i x from a vec to r9 > mat r i x ( c (1 ,2 ,3 ,4 ) , 2 , 2)
10 [ , 1 ] [ , 2 ]11 [ 1 , ] 1 312 [ 2 , ] 2 413
14 > x = c (1 ,1 ,1 ,0 ,0 ,0 ) ; y = c (1 ,0 ,1 ,0 ,1 ,0 )15 > cbind ( x , y )16 x y17 [ 1 , ] 1 118 [ 2 , ] 1 019 [ 3 , ] 1 120 [ 4 , ] 0 021 [ 5 , ] 0 122 [ 6 , ] 0 0
8/49
Basic Operations
1 > # some simple opera t ions2 > x [ 3 ]3 [ 1 ] 14 > x [ 2 : 5 ]5 [ 1 ] 1 1 0 06
7 > # subse t t i ng a mat r i x8 > cbind ( x , y ) [ 1 : 2 , ]9 x y
10 [ 1 , ] 1 111 [ 2 , ] 1 012
13 > # element−wise opera t ions14 > ( x + y ) ˆ215 [ 1 ] 4 1 4 0 1 016
17 > # dimensions18 > l eng th ( x )19 [ 1 ] 620 > dim ( cbind ( x , y ) )21 [ 1 ] 6 2
9/49
Basic Operations
1 > # A warning w i l l be issued when R detec ts something wrong . Resul ts mays t i l l be produced .
2 > x + c (1 ,2 ,3 ,4 )3 [ 1 ] 2 3 4 4 1 24 Warning message :5 In x + c (1 , 2 , 3 , 4) :6 l onger ob jec t leng th i s not a m u l t i p l e o f sho r te r ob jec t leng th7 > # you can get re ference of a f u n c t i o n using a quest ion mark i n f r o n t o f
i t8 > ?mean9 > ? t . t e s t
10/49
Descriptive Statistics
• When we are facing a new dataset, the first thing to do is tosummarize the data using descriptive statistics
• These includes, but not limited to• mean, median, quantiles• variance, standard deviation• correlation
• Sometimes figures and plots can help quickly understand patternin a data.
11/49
Descriptive Statistics
• We will use a Prostate Cancer Data ( prostate ) from theElemStatLearn package.
• This dataset comes from a study by Stamey et al. (1989) thatexamined the prostate specific antigen (PSA) and some clinicalmeasures in 97 men.
• You will need to first install the package since it is not included inthe base R version.
1 i n s t a l l . packages ( ” ElemStatLearn ” )
12/49
Descriptive Statistics
• After installing the package, we can load the data
• head() is a function that displays the first several rows of the data
1 > l i b r a r y ( ElemStatLearn )2 > data ( p ros ta te )3 > # we w i l l remove the l a s t column since i t s not used i n t h i s ana lys i s4 > pros ta te = p ros ta te [ , −10]5 > dim ( p ros ta te )6 [ 1 ] 97 97 > head ( round ( pros ta te , 3) )8 l c a v o l lwe igh t age lbph s v i l cp gleason pgg45 lpsa9 1 −0.580 2.769 50 −1.386 0 −1.386 6 0 −0.431
10 2 −0.994 3.320 58 −1.386 0 −1.386 6 0 −0.16311 3 −0.511 2.691 74 −1.386 0 −1.386 7 20 −0.16312 4 −1.204 3.283 58 −1.386 0 −1.386 6 0 −0.16313 5 0.751 3.432 62 −1.386 0 −1.386 6 0 0.37214 6 −1.050 3.229 50 −1.386 0 −1.386 6 0 0.765
13/49
Descriptive Statistics
• The dataset contains 8 clinical variables and 1 outcome (“lpsa”).Among them, “svi” and “gleason” are categorical variables.
Name Attributelcavol log cancer volumelweight log prostate weightage age in yearslbph log of benign prostatic hyperplasia amountsvi seminal vesicle invasionlcp log of capsular penetrationgleason Gleason scorepgg45 percent of Gleason scores 4 or 5lpsa log of prostate specific antigen (outcome)
14/49
Descriptive Statistics
1 > summary ( p ros ta te )2 l c a v o l lwe igh t age3 Min . :−1.3471 Min . :2 .375 Min . :41 .004 1 s t Qu . : 0.5128 1 s t Qu. : 3 . 3 7 6 1 s t Qu. : 6 0 . 0 05 Median : 1.4469 Median :3 .623 Median :65.006 Mean : 1.3500 Mean :3 .629 Mean :63.877 3rd Qu . : 2.1270 3rd Qu. : 3 . 8 7 6 3rd Qu. : 6 8 . 0 08 Max . : 3.8210 Max . :4 .780 Max . :79.009
10 lbph s v i l cp11 Min . :−1.3863 Min . :0.0000 Min . :−1.386312 1 s t Qu.:−1.3863 1 s t Qu. :0 .0000 1 s t Qu.:−1.386313 Median : 0.3001 Median :0.0000 Median :−0.798514 Mean : 0.1004 Mean :0.2165 Mean :−0.179415 3rd Qu . : 1.5581 3rd Qu. :0 .0000 3rd Qu . : 1.178716 Max . : 2.3263 Max . :1.0000 Max . : 2.904217
18 gleason pgg45 lpsa19 Min . :6 .000 Min . : 0.00 Min . :−0.430820 1 s t Qu. : 6 . 0 0 0 1 s t Qu . : 0.00 1 s t Qu . : 1.731721 Median :7 .000 Median : 15.00 Median : 2.591522 Mean :6 .753 Mean : 24.38 Mean : 2.478423 3rd Qu. : 7 . 0 0 0 3rd Qu . : 40.00 3rd Qu . : 3.056424 Max . :9 .000 Max . :100.00 Max . : 5.5829
15/49
Descriptive Statistics
1 > # mean and sd are standard s t a t i s t i c a l f u n c t i o n s2 > mean( p ros ta te $ l c a v o l )3 [ 1 ] 1.350014 > sd ( p ros ta te $ l c a v o l )5 [ 1 ] 1.1786256 > # count f requenc ies f o r c a t e g o r i c a l v a r i a b l e s7 > t ab l e ( p ros ta te $gleason )8 6 7 8 99 35 56 1 5
10 > # we can view a s i n g l e cont inous v a r i a b l e using histogram11 > h i s t ( p ros ta te $ l c a v o l )
Histogram of prostate$lpsa
prostate$lpsa
Frequency
-1 0 1 2 3 4 5 6
010
2030
40
16/49
Scatter Plot and Correlation
• One commonly used method for analyzing two continuousvariables is the Pearson’s correlation coefficient
• The correlation of two random variables X and Y is formallydefined as
E[(X − E(X)
)(Y − E(Y )
)]√Var(X)Var(Y )
• But what does this mean? Let’s look at the scatter plot of the twovariables: lcavol and lpsa
17/49
Scatter Plot and Correlation
-1 0 1 2 3 4
01
23
45
cancer volume
pros
tate
spe
cific
ant
igen
Correlation can be calculated using the cor() function1 > cor ( p ros ta te $ l cavo l , p ros ta te $ lpsa )2 [ 1 ] 0.7344603
18/49
Scatter Plot and Correlation
• There is obviously an association between the two variables.
• As lcavol (the horizontal variable) increases, lpsa (the verticalvariable) also tend to increase
• That pattern indicates that there is a positive correlation (denotedas ρXY ) between the two random variables.
• But how strong that relationship is? That is quantified bycorrelation coefficient.
• Here are some different correlation patterns.
19/49
Scatter Plot and Correlation
-3 -2 -1 0 1 2 3 4
-20
24
Strong Positive Correlation (0.9)
-3 -2 -1 0 1 2 3 4
-6-4
-20
24
6
Moderate Positive Correlation (0.5)
-3 -2 -1 0 1 2 3 4
05
1015
(Nearly) Zero Correlation
-3 -2 -1 0 1 2 3 4
-3-2
-10
12
(Nearly) Zero Correlation
-3 -2 -1 0 1 2 3 4
-6-4
-20
24
6Moderate Negative Correlation (-0.5)
-3 -2 -1 0 1 2 3 4
-3-2
-10
12
3
Strong Negative Correlation (-0.9)
20/49
Scatter Plot and Correlation
• The previous plots represent 6 different correlation patterns.
• The most interesting one is the top right and bottom left ones,both with correlations close to 0, however, due to differentreasons.
• The bottom left is due to independence — the two variables arenot related whatsoever.
• The top right is due to symmetry. The two variables are clearlyassociated, however, that association cannot bedetected/captured by a linear relationship
21/49
Summary about correlations
• Pearson’s correlation coefficient describes linear associationsbetween two variables
• It is bounded between −1 and 1. More extreme values meanstronger associations.
• Some nonlinear relationships may not be detected by Pearson’scorrelation coefficient. Solutions?
• transformations of the variable may help• other measures maybe more sensitive, e.g. distance correlation
• Correlation DOSE NOT mean causality
22/49
All pairwise correlations
• When we have multiple variables, it is convenient to look at allpairwise correlations
• Diagonal elements are always 0 for a correlation matrix becauseany variable is perfectly correlated with itself.
1 > round ( cor ( p ros ta te ) , 3)2 l c a v o l lwe igh t age lbph s v i l cp gleason pgg45 lpsa3 l c a v o l 1.000 0.281 0.225 0.027 0.539 0.675 0.432 0.434 0.7344 lwe igh t 0.281 1.000 0.348 0.442 0.155 0.165 0.057 0.107 0.4335 age 0.225 0.348 1.000 0.350 0.118 0.128 0.269 0.276 0.1706 lbph 0.027 0.442 0.350 1.000 −0.086 −0.007 0.078 0.078 0.1807 s v i 0.539 0.155 0.118 −0.086 1.000 0.673 0.320 0.458 0.5668 l cp 0.675 0.165 0.128 −0.007 0.673 1.000 0.515 0.632 0.5499 gleason 0.432 0.057 0.269 0.078 0.320 0.515 1.000 0.752 0.369
10 pgg45 0.434 0.107 0.276 0.078 0.458 0.632 0.752 1.000 0.42211 lpsa 0.734 0.433 0.170 0.180 0.566 0.549 0.369 0.422 1.000
23/49
All pairwise correlations
1 > # t h i s i s done using another package2 > l i b r a r y ( c o r r p l o t )3 > c o r r p l o t ( cor ( p ros ta te ) , type = ” upper ” , t l . co l = ” b lack ” , t l . s r t = 35)
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1lcavollweight
age
lbph
svi lcp gle
ason
pgg45lpsa
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
lpsa
24/49
Correlation Matrix for Genomic Data
• In genomic, the features are often ordered in a meaningful way
• Correlation plots can help see the patterns
Example: Linkage disequilibrium plot from Pistis et. al. (2013)
25/49
Hypothesis Testing
Overview
Why do we need statistics?
— Analyze and extract information from data, and make inference
— A rigourous mathematical framework that addresses pitfalls inthe data collection process
— Model fitting, variable selection and hypothesis testing
— Use statistical models in scientific discoveries
27/49
The Lady Tasting Tea Problem
The Lady Tasting Tea: A real story about the origin of statistics
• In 1920s Cambridge, England, aLady, named Muriel Bristol, claimedto be able to tell whether the tea orthe milk was added first!
• A statistician Ronald A. Fisherdesigned a method to test whethershe had that ability
“The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century” (2001) by David Salsburg
28/49
The Lady Tasting Tea Problem
The Lady Tasting Tea: A real story about the origin of statistics
• In 1920s Cambridge, England, aLady, named Muriel Bristol, claimedto be able to tell whether the tea orthe milk was added first!
• A statistician Ronald A. Fisherdesigned a method to test whethershe had that ability
“The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century” (2001) by David Salsburg
28/49
Fisher’s exact test
• 8 cups of tea were prepared, 4 with milkadded first and 4 with tea added first.The lady was asked to identify the 4 cupsprepared by one method.
• There are totally 8!4!(8−4)! = 70 possible
results.
• IF she does not have the ability to identifythem, and just do a random guess, theprobability of successfully identifying...
• At least 2 cups: 53/70 ≈ 75.7%
• At least 3 cups: 17/70 ≈ 24.3%
• All 4 cups: 1/70 ≈ 1.4%
• She identified all 4 cups! Whatconclusion can we draw? Is it “likely” thatshe has the ability?
Sir Ronald A. Fisher(1890 - 1962)
29/49
Fisher’s exact test
• 8 cups of tea were prepared, 4 with milkadded first and 4 with tea added first.The lady was asked to identify the 4 cupsprepared by one method.
• There are totally 8!4!(8−4)! = 70 possible
results.
• IF she does not have the ability to identifythem, and just do a random guess, theprobability of successfully identifying...
• At least 2 cups: 53/70 ≈ 75.7%
• At least 3 cups: 17/70 ≈ 24.3%
• All 4 cups: 1/70 ≈ 1.4%
• She identified all 4 cups! Whatconclusion can we draw? Is it “likely” thatshe has the ability?
Sir Ronald A. Fisher(1890 - 1962)
29/49
Fisher’s exact test
• 8 cups of tea were prepared, 4 with milkadded first and 4 with tea added first.The lady was asked to identify the 4 cupsprepared by one method.
• There are totally 8!4!(8−4)! = 70 possible
results.
• IF she does not have the ability to identifythem, and just do a random guess, theprobability of successfully identifying...
• At least 2 cups: 53/70 ≈ 75.7%
• At least 3 cups: 17/70 ≈ 24.3%
• All 4 cups: 1/70 ≈ 1.4%
• She identified all 4 cups! Whatconclusion can we draw? Is it “likely” thatshe has the ability?
Sir Ronald A. Fisher(1890 - 1962)
29/49
Fisher’s exact test
• 8 cups of tea were prepared, 4 with milkadded first and 4 with tea added first.The lady was asked to identify the 4 cupsprepared by one method.
• There are totally 8!4!(8−4)! = 70 possible
results.
• IF she does not have the ability to identifythem, and just do a random guess, theprobability of successfully identifying...
• At least 2 cups: 53/70 ≈ 75.7%
• At least 3 cups: 17/70 ≈ 24.3%
• All 4 cups: 1/70 ≈ 1.4%
• She identified all 4 cups! Whatconclusion can we draw? Is it “likely” thatshe has the ability?
Sir Ronald A. Fisher(1890 - 1962)
29/49
The Lady Tasting Tea
Some key steps in this story (hypothesis testing):
1). Form null and alternative hypotheses:
Null H0 : Random Guessing vs. Alt H1 : Not Random Guessing
2). Decide a significance level, e.g. α = 0.05.
3). Perform an experiment and observe that the lady identified the 4correctly.
4). If the Null hypothesis is correct, there is only 1.4% chance thatone can guess 4 correctly
5). This is a “small probability event” (< α), so we will make aconclusion (reject the Null).
30/49
The Lady Tasting Tea
• These steps are known as the hypothesis testing
• The 1.4% that we calculated is called the p-value• Intuitive Definition: If the Null hypothesis is correct, what is the
chance to observe what we observed (or even more extremevalues)
• However, does rejecting the Null hypothesis means that the ladyactually has the ability to tell the difference?
• No — whatever conclusion that we make, it could be just due torandomness (she is lucky or unlucky)
31/49
Example 1: two sample t test
• I am interested in testing whether the binary variable svi(seminal vesicle invasion) is associated with the continuousoutcome lpsa.
• Null H0: Means of lpsa are the same for different svi status• Alt H1: They are not the same
• The can be done using a two-sample t test.• Consider subjects with svi = 0 as one set of samples, and svi
= 1 as the other set of samples, both with lpsa measured.
• The test can be done using the function t.test()
32/49
Example 1: two sample t test
• One way to do it is to create two vectors, corresponds to thesamples of each group.
1 > g1 = pros ta te $ lpsa [ p ros ta te $ s v i == 1]2 > g0 = pros ta te $ lpsa [ p ros ta te $ s v i == 0]3 > t . t e s t ( g0 , g1 )4
5 Welch Two Sample t−t e s t6
7 data : g0 and g18 t = −6.8578, d f = 33.027 , p−value = 7.879e−089 a l t e r n a t i v e hypothes is : t r ue d i f f e r e n c e i n means i s not equal to 0
10 95 percent conf idence i n t e r v a l :11 −2.047129 −1.11040912 sample est imates :13 mean of x mean of y14 2.136592 3.715360
33/49
Example 1: two sample t test
• In R, some alternative coding will give us the exactly same result
1 > t . t e s t ( lpsa ~ sv i , data = p ros ta te )2
3 Welch Two Sample t−t e s t4
5 data : lpsa by s v i6 t = −6.8578, d f = 33.027 , p−value = 7.879e−087 a l t e r n a t i v e hypothes is : t r ue d i f f e r e n c e i n means i s not equal to 08 95 percent conf idence i n t e r v a l :9 −2.047129 −1.110409
10 sample est imates :11 mean i n group 0 mean i n group 112 2.136592 3.715360
34/49
Example 1: two sample t test
• The p-value is 7.879e-08 (highly significant), meaning that wewill reject the Null hypothesis
• Note that this function makes some assumptions (we will notcover) by default, in particular, unequal variances.
• Important note: rejecting the Null hypothesis DOSE NOT meanthat the Null hypothesis is indeed wrong — similar to the ladytasting tea problem.
• How do we interpret this p-value?
35/49
Example 1: two sample t test
• It is possible to make a wrong conclusion using hypothesistesting procedures
H0 is true H0 is falseAccept H0 X Type II ErrorReject H0 Type I Error X
• Four situations:• If H0 is true, we could still reject H0 with α probability (bad); acceptH0 with 1− α probability (good)
• If H0 is false, its difficult to know exactly the probabilities unless wemake some more assumptions. However, its still possible to makemistakes regardless.
• Knowing what statistics can or cannot do is very crucial.
36/49
Example 2: regression model
• Suppose that lpsa can be model by the svi status through arelationship
lpsa ∼ β0 + β1svi
• This is a regression model, and also a one-way ANOVA
• If β1 is nonzero (we do not know), then svi will affect lpsa• Expected lpsa when svi = 0 is β0• Expected lpsa when svi = 1 is β0 + β1
• • Null H0: β1 = 0
• Alt H1: β1 6= 0
37/49
Example 2: regression model
1 > f i t = glm ( lpsa ~ sv i , data = pros ta te , f a m i l y = ” gaussian ” )2 > summary ( f i t )3
4 Ca l l :5 glm ( formula = lpsa ~ sv i , f a m i l y = ” gaussian ” , data = p ros ta te )6 Deviance Residuals :7 Min 1Q Median 3Q Max8 −2.56737 −0.64035 −0.00301 0.66979 1.893219
10 C o e f f i c i e n t s :11 Est imate Std . E r ro r t value Pr(>| t | )12 ( I n t e r c e p t ) 2.1366 0.1097 19.474 < 2e−16 * * *13 s v i 1.5788 0.2358 6.696 1.5e−09 * * *14 −−−15 S i g n i f . codes : 0 * * * 0 .001 * * 0 .01 * 0 .05 . 0 .1 116
17 ( D ispers ion parameter f o r gaussian f a m i l y taken to be 0.9148091)18 Nu l l deviance : 127.918 on 96 degrees o f freedom19 Residual deviance : 86.907 on 95 degrees o f freedom20 AIC : 270.6221
22 Number o f F isher Scor ing i t e r a t i o n s : 2
38/49
Example 2: regression model
• lpsa ∼ svi specifies the regression model. The intercept isincluded by default
• data = prostate specifies the dataset that contains the variableslpsa and svi
• family = ”gaussian” tells R that the outcome variable is acontinuous data type (Gaussian distribution).
• The result shows that we have significant evidence (α = 0.05) toclaim that lpsa level will be different
1 Est imate Std . E r ro r t value Pr(>| t | )2 ( I n t e r c e p t ) 2.1366 0.1097 19.474 < 2e−16 * * *3 s v i 1.5788 0.2358 6.696 1.5e−09 * * *
39/49
Example 3: multiple regression model
• What if we have many variables that we want to use to model theoutcome?
• Some of them may be continuous, some are categorical. Forexample,
lpsa ∼ β0 + β1svi+ β2lcavol+ β3age
• This allows us to analyze the effect of each variable byconsidering the effect of others.
• Why this is not the same as running them separatively? e.g.Simpson’s paradox
40/49
Example 3: multiple regression model
• The effect of a variable (x) on the outcome (y) may change if wefurther conditioning on a third variable (red/blue group)
(Figure from Wikipedia)
41/49
Example 3: multiple regression model
1 > f i t = glm ( lpsa ~ s v i + l c a v o l + age , data = pros ta te , f a m i l y = ”gaussian ” )
2 > summary ( f i t )3
4 Ca l l :5 glm ( formula = lpsa ~ s v i + l c a v o l + age , f a m i l y = ” gaussian ” ,6 data = p ros ta te )7
8 Deviance Residuals :9 Min 1Q Median 3Q Max
10 −1.6112 −0.5279 0.1208 0.4890 1.687211 C o e f f i c i e n t s :12 Est imate Std . E r ro r t value Pr(>| t | )13 ( I n t e r c e p t ) 1.4808297 0.6725103 2.202 0.03014 *14 s v i 0.6698231 0.2223268 3.013 0.00333 * *15 l c a v o l 0.5913356 0.0795912 7.430 5.17e−11 * * *16 age 0.0008492 0.0106885 0.079 0.9368517 −−−18 S i g n i f . codes : 0 * * * 0 .001 * * 0 .01 * 0 .05 . 0 .1 1
42/49
Example 3: multiple regression model
• lcavol seems to be a very strong predictor for the outcomelpsa, and it “took away” some of the effects explained by svi inthe previous model
• As we keep adding/removing variables, the effects (parameters)may change (even dramatically).
• However, we usually do not want to include too many variables(relatively to the sample size) because that is going to make themodel very “fragile” — very unstable parameter estimates.
43/49
Example 4: multiple testing
• The more hypothesis tests we perform, the more likely that wewill see a significant p-value just by chance, even if the null istrue.
• Every one who is sitting in this classroom can participant in thetea tasting experiment and select the cups randomly. In thatcase, we will probably see a few people who can identify all cupscorrectly.
• Performing thousands of tests on gene expressions one by onemay allow us to find many significant results, however, they couldall just due to randomness.
44/49
Example 4: multiple testing
1 > # we generate independent x and y values so t h a t they are notassoc iated
2 > pvalues = rep (NA, 1000)3 > f o r ( i i n 1:1000)4 + {5 + x = rbinom (20 , 1 , 0 .5 )6 + y = rnorm (20)7 + pvalues [ i ] = t . t e s t ( y~x ) $p . value8 + }9 > # however , there w i l l be a la rge number o f them tu rn out s i g n i f i c a n t
10 > t ab l e ( pvalues <= 0.05)11
12 FALSE TRUE13 946 54
45/49
Example 5: high-dimensional data
• When fitting a linear regression model with too many variables(e.g. gene expressions) we could overfit the model badly.
• This is particularly true when the number of variables can exceedthe number of observations (see an example in thesupplementary file)
• Some machine learning models can handle such situationsbetter — they tend to select only the important variables andeliminate noise variables.
• The Lasso and Random Forets are two popular models
46/49
Example 5: high-dimensional data
1 > # 55 observat ions wi th 50 v a r a i b l es2 > set . seed ( 1 ) ; n = 55; p = 503 > x = mat r i x ( rnorm ( n*p ) , n , p )4 > # y i s independent o f x5 > y = rnorm ( n )6 > a l l c o e f = summary ( lm ( y~x ) ) $ c o e f f i c i e n t s7 >
8 > # j u s t the v a r a i b l e s t h a t est imated to be s i g n i f i c a n t9 > a l l c o e f [ a l l c o e f [ , 4 ] < 0.05 , ]
10
11 Est imate Std . E r ro r t value Pr(>| t | )12 x8 1.7322432 0.5750871 3.012140 0.0394671213 x13 −2.4388295 0.7739399 −3.151187 0.0344746514 x16 −1.7217647 0.4904563 −3.510537 0.0246599415 x17 2.0557598 0.6792156 3.026667 0.0389076016 x20 0.9637801 0.2222795 4.335892 0.0122925817 x24 −1.2110275 0.3670152 −3.299666 0.0299427918 x33 0.6050861 0.2096115 2.886703 0.0447107619 x41 0.9111477 0.2967598 3.070321 0.0372813120 x43 1.2065449 0.3980097 3.031446 0.0387255921 x45 1.2441839 0.3081686 4.037348 0.0156382422 x47 1.1829540 0.3031894 3.901700 0.01751626
47/49
Example 5: high-dimensional data
1 > # the lasso model i s implemented i n the ” glmnet ” package2 > l i b r a r y ( glmnet )3 > # we use a cross−v a l i d a t i o n to s e l e c t the best model4 > lasso . f i t = cv . glmnet ( x , y )5 > # the lasso model est imates a l l c o e f f i c i e n t s to be 0 except the
i n t e r c e p t term6 > sum( as . numeric ( coef ( lasso . f i t ) ) ! = 0)7 [ 1 ] 18 >
48/49
Questions?
49/49