multiple comparisons in microarray data analysis

This article was downloaded by: [Queensland University of Technology]On: 31 October 2014, At: 05:21Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

Statistics in Biopharmaceutical ResearchPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/usbr20

Multiple Comparisons in Microarray Data AnalysisDonghui Zhang & Li LiuPublished online: 01 Jan 2012.

To cite this article: Donghui Zhang & Li Liu (2010) Multiple Comparisons in Microarray Data Analysis, Statistics inBiopharmaceutical Research, 2:3, 368-382, DOI: 10.1198/sbr.2009.08086

To link to this article: http://dx.doi.org/10.1198/sbr.2009.08086

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose ofthe Content. Any opinions and views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be reliedupon and should be independently verified with primary sources of information. Taylor and Francis shallnot be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and otherliabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/usbr20

http://www.tandfonline.com/action/showCitFormats?doi=10.1198/sbr.2009.08086

http://dx.doi.org/10.1198/sbr.2009.08086

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Multiple Comparisons in Microarray DataAnalysis

Donghui ZHANG and Li LIU

Multiplicity is a challenging statistical issue in drugdiscovery, and a particular example is microarray study.The traditional approach of controlling of the family-wise error rate (FWER) is conservative when the num-ber of tests is large. A more appropriate approach is tocontrol the false discovery rate (FDR). Since the devel-opment of the Benjamini and Hochberg (BH) FDR pro-cedure in 1995, many modifications have been proposedaimed at relaxing the requirement for independent teststatistics or improving the power of the BH FDR pro-cedure. Comparisons of these procedures in the currentliterature are not comprehensive and the conclusions onperformances are inconsistent. The objectives of this ar-ticle are three-fold: (a) to perform a more comprehensivecomparison of extant multiple testing procedures usingtwo real microarray datasets and various simulated datasets; (b) to explore potential reasons for the inconsisten-cies in published simulation results; and (c) to identifysuitable FDR procedures under different scenarios ac-cording to covariance structure, percent of true null hy-potheses among multiple tests, and sample size.

Key Words: False discovery rate; Family-wise error rate; Multiple

testing.

1. Introduction

Advances in biology, chemistry, engineering, and au-tomation provide researchers in the pharmaceutical in-dustry many enabling technologies. For example, mi-croarray technology allows the monitoring of thousandsof gene expression levels in a single experiment; fMRI

technique provides a noninvasive way of studying drugeffects and functional activity in the living brain.

A primary interest of microarray studies is the identi-fication of differentially expressed genes under differenttreatment conditions or different population classes; anda central question in brain imaging studies is to find ar-eas of the brain that differ between two groups, for ex-ample, a control group and a treated group or subjectswith or without a disease. Related situations also arise inclinical research looking at occurrences of adverse eventsby type of event or looking at outcomes in patient sub-groups. Statistical methods for these data are similar andcan be broken down into three main steps: (1) normal-ization/transformation; (2) univariate modeling for com-putation of statistics that test differences between groupmeans for each data item (e.g., individual gene levelin microarray data, or volume element (voxel) in fMRIdata); and (3) inference on the set of test statistics. Al-though challenging statistical issues present at each of theabove three steps, this article focuses on the multiplicityissue encountered in Step 3.

Assuming there are m univariate statistical tests fromtesting the significance of m genes or voxels where mis large, and the Type I error for a single test (known asper comparison error) is controlled at αpc = 0.05, therepotentially can be a substantial number of false positivefindings. For example, an image with m = 64×64×64 =262,144 voxels can have up to 262,144× 0.05 ≈ 13107false positives. The multiplicity issue with a large num-ber of simultaneous univariate tests is the greatly elevatedfalse discovery rate.

c© American Statistical AssociationStatistics in Biopharmaceutical Research

2010, Vol. 2, No. 3DOI: 10.1198/sbr.2009.08086

368

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014

http://pubs.amstat.org/loi/sbr

http://www.amstat.org

Multiple Comparisons in Microarray Data Analysis

The traditional procedure for multiplicity adjustmentis to control the probability of making one or more Type 1errors among m tests, known as family-wise Type 1error rate (FWER), at a prespecified significance levelα fw. The relationship between the FWER and the per-comparison error rate is given by α fw = 1− (1−αpc)

m .To control α fw at 0.05, the αpc must be adjusted to amuch smaller value than 0.05 as m increases. For largem (as is typically the case for microarray and fMRIdata), the FWER approach becomes extremely conser-vative and has low sensitivity when the null hypothesis isfalse.

An alternative procedure for controlling the false dis-covery rate (FDR) was first proposed by Benjamini andHochberg (BH) in their pioneering work (1995). Con-trolling the FDR at a prespecified level q roughly cor-responds to controlling the proportion of erroneous re-jections among all rejected hypotheses at level q. TheFDR criterion attains an appropriate balance between theliberal per-comparison error rate and the conservativeFWER criteria. Since the publication of the BH FDR pro-cedure in 1995, many other FDR procedures have beenproposed aimed at relaxing the requirement of indepen-dence among the m univariate test statistics or improvingthe power.

Comparisons of some FWER and FDR procedures us-ing simulated data and/or real microarray data have beenreported in the literature (Dudoit et al. 2003; Qian andHuang 2005; Li et al. 2005; Benjamini et al. 2006). How-ever, the comparisons are not comprehensive either be-cause they include only a few procedures with a lim-ited focus, use real data only without simulations, orare restricted to specific covariance structures or samplesizes. In addition, the simulation results are not consis-tent across publications for the same procedures.

The objectives of this article are: (a) to perform amore comprehensive comparison by including more pro-cedures and using both real and simulated data; (b) toinvestigate the reasons for inconsistent results in the lit-erature; and (c) to identify FDR procedures best suited tospecific data settings.

The article is organized as follows. Section 2 gives ba-sic notation for multiple testing and a short review of 5FWER and 10 FDR procedures. In Section 3 we apply 14multiple testing procedures (5 FWER, 9 FDR) to two realmicroarray datasets using estimated FDR rates as evalu-ation criteria. We further compare the 5 FWER and 10FDR procedures with simulated data in Section 4. In Sec-tion 5, we discuss the results and compare different ap-proaches with recommendations on selecting proper ap-proaches for various scenarios. We also provide potentialreasons for the inconsistency among published papers.

2. Notation and Procedures

2.1 Notation

We consider the two-sample problem and use microar-ray data to explain the notation. Assume that there arem multiple tests corresponding to m genes and a total ofn samples with n1 from untreated samples and n2 fromtreated samples. Notation generally used in the multipletesting literature is as follows:

Hypothesis Accept Reject TotalNull true U V m0

Null not true T S m1

W R m

Family-wise error rate is defined as FWER = P(V ≥ 1).False discovery rate is defined as

FDR ={

E( VR ) if R > 0

0 if R = 0

= E[

V

R|R > 0

]P(R > 0)

and positive false discovery rate as: pFDR =E[ V

R |R > 0].

For i = 1, . . . ,m hypothesis tests, define

Pi : Observed p-values from the i th hypothesis Hi .

P(i): The i th ordered observed p-value from the smallestto the largest.

Pi , P(i): Corresponding to the i th unordered or orderedp-value after multiplicity adjustment.

q: The desired FDR level.

2.2 FWER Procedures to be Compared

Five FWER procedures (Table 1) are included asbenchmarks for comparison with the FDR procedures.It is well recognized that FWER procedures tend to beoverly conservative with low power and consequently,are less favored than FDR procedures in settings wherelarge scale multiple testing is involved.

• Bonferroni procedure: P(i) = min(m P(i), 1)

• Bonferroni–Holm procedure: P(i) = max(P(i−1),(m − i + 1)P(i))

• Hochberg procedure: P(i) = min(P(i+1), (m − i +1)P(i))

• Sidak procedure: P(i) = 1− (1− P(i))m

• Sidak Step Down procedure: P(i) = max(P(i−1), 1−(1− P(i))(m−i+1))

369

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014

Statistics in Biopharmaceutical Research: Vol. 2, No. 3

Table 1. The description of multiple testing procedures

Procedure Procedure Type I Dependenceabbreviation description error structure

Rawp Unadjusted p-values PCER General

Bonferroni Bonferroni FWER General

Holm stepdown Bonferroni FWER Generalmethod of Holm(1979)

Hochberg Hochberg’s step-up FWER Some dependenceBonferroni method(1988)

Sidak Sidak FWER Independence orpositive orthant dependence

SidakSD Step-down Sidak FWER Independence orpositive orthant dependence

BH Benjamini and Hochberg (1995) FDR Independence orpositive regression dependence

BY Benjamini and Yekutieli (2001) FDR General

DwnInd Step-down independent FDR Independence(Benjamini and Liu 1999a)

DwnFree Step Down Distrinution Free FDR General(Benjamini and Liu 1999a)

Qvalue Qvalue (Storey 2002) FDR Independence orweak dependence

Adaptive1 Adaptive linear step up FDR Independence(Benjamini, Krieger, and Yekutieli 2006)

Adaptive2 Two-stage adaptive linear step-up FDR Independence or positive regression(Benjamini, Krieger, and Yekutieli 2006) dependence if E(m0/m0) ≤ 1

Resamp.point Resampling point estimate FDR Some dependence(Reiner, Yekutieli, Benjamini 2003)

Resamp.upper Resampling upper limit FDR Some dependence(Reiner, Yekutieli, Benjamini 2003)

EBayes.ZT Empirical Bayes FDR Independence and test statistics(Zhang and Tang 2005) for null hypothesis should be

standard normal

Note: m0 is the proportion of null hypothesis, and m0 is an estimate of m0.

2.3 FDR Procedures to be Compared

Comparisons are carried out on two real microarraydata using 9 FDR procedures. For simulated data, in ad-dition, an empirical Bayes and conditional FDR (Zhangand Tang 2005) is also included. Table 1 lists the shortdescriptions of the 10 FDR procedures. Below are moredetailed descriptions.

2.3.1 Linear Step-Up Benjamini and Hochberg (BH)FDR procedure (1995)

This is a step-up procedure for strong control ofthe FDR. Assuming independent null test statistics, theBH procedure controls FDR at FDR ≤ m0

m q ≤ qwith adjusted p-values to be: P(m) = P(m), P(i) =min(m

i P(i), P(i+1)), for i = 1, 2, . . . ,m − 1. The de-cision rule is to reject all H( j), j = 1, 2, . . . , k, wherek = max{i : P(i) ≤ i

m q}.

370

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


2.3.2 Benjamini and Yekutieli (BY) FDR Procedure(2001)

Expanding on the BH FDR method, BY make twocontributions. They proved that BH controls FDR underpositive regression dependency and modified the BH pro-cedure so that FDR ≤ m0

m q ≤ q under general conditionsof dependency with a trade-off of lower power comparedto BH FDR procedure. The BY FDR procedure adjustedp-values are: P(m) = min(P(m) ×

∑mi=1

1i , 1), P(i) =

min(mi P(i) ×

∑mi=1

1i , P(i+1)), for i = 1, 2, . . . ,m − 1.

The decision rule is to reject all H( j), j = 1, 2, . . . , k,where k = max{i : P(i) ≤ i

m q/∑m

i=11i }.

2.3.3 Four Adaptive FDR Procedures

In the BH FDR procedure, FDR ≤ m0m q ≤ q, where

m0 is the number of true null hypotheses, which is an un-known variable. The motivation for adaptive FDR is toimprove the power of BH FDR procedure by first esti-mating m0, then using q ′ = qm

m0as the FDR level in the

BH linear step-up procedure so that the resulting FDRlevel will be equal to q instead of less than q. We con-sider the following four adaptive FDR procedures, whichdiffer in the ways they estimate m0.

Adaptive linear step up procedure (Adaptive 1) (Ben-jamini, Krieger, and Yekutieli 2006)

1. Use the BH FDR procedure at level q, and stop if nohypotheses are rejected.

2. Estimate m0(i) by (m + 1− i)/(1− P(i)).

3. Start with i = 2 and stop at the first time m0(i) >m0(i − 1).

4. Estimate m0 = min(m0(i),m) rounding up to thenext integer.

5. Use the BH FDR procedure with q ′ = qmm0

.

The estimate of m0(i) in Step 2 is based on a graphicalapproach proposed by Schweder and Spjøtvoll (1982).The validity of this estimate depends on the number oftrue null hypotheses (should be sufficiently large) and thecorrelations among p-values (should not be highly posi-tively correlated).

Two stage adaptive linear step-up procedure (Adap-tive 2) (Benjamini, Krieger, and Yekutieli 2006).

Stage 1: Use the BH linear step-up FDR procedure atlevel q ′ = q

1+q , and let r1 be the resulted number ofrejections.

Stage 2: Estimate m0 = m − r1, and use the BH FDRagain with q∗ = q ′m

m0.

The choice of level q ′ = q1+q in Stage 1 is motivated

by m0 ≤ m − (R − V ). Since V is less than qm0 R/mwith the BH FDR procedure, it can be shown that m0 ≤(m − R)(1+ q) with some algebra and approximations.Consequently

q ′ =qm

m0≥

qm

(m − R)(1+ q)≥

q

1+ q.

Q-value (Storey and Tibshirani 2003)

1. Estimate m0, or equivalently, π0 ≡m0m by π0 =

#{Pi>λ,i=1,...,m}m(1−λ) .

2. Use the linear step up procedure with q∗ = qm/m0.

Here λ is a tuning parameter between 0 and 1. There aretwo ways to set this tuning parameter. One way is to plotthe density histogram of all observed m p-values, andfind λ on the x-axis so that the density histogram fromthis point forward is approximately flat. The rationalebehind this is that the true null hypotheses p-values areuniform(0,1), thus the density corresponding to observedp-values from true null hypotheses should be flat, andthe proportion of true null hypotheses π0 correspondsto the height of the flat portion of the density histogramplot. Another way is to use a more automated algorithm:try different discrete values for a range of λ and calcu-late π(λ), plot λ versus π0(λ), apply a cubic spline fit tothis plot, and finally estimate π0 as the value of the fittedcurve at λ = 1.

Empirical Bayes and Conditional FDR (Zhang andTang 2005).

Zhang and Tang proposed a Bayes formulation for max-imizing the statistical discovery rate R subject to a pre-assigned level of the false discovery V

R conditional onthe test statistics X . Let (X, θ),(Xi , θi ), i ≤ m, be iidvectors where Xi are test statistics for Hi and 1 − θi =I {true Hi } are indicators for the true null hypotheses.Suppose X |(θ = j) f j , j = 0, 1, and P{θ = 0} = π0(this is w0 in their notation). Then X has a mixture den-sity f = w0 f0 + (1− w0) f1.

The interest is in estimating P(θ = 0|X = x), theposterior probability of a null hypothesis conditional onthe test statistic. According to Bayes theorem, this isP(θ = 0|X = x) = π0 f0(x)/ f (x). Because the el-ements on the right-hand side are unknown, empiricalBayes (EB) methods are used. Zhang and Tang discussedEB methods with different ways of estimating this poste-rior probability including using newly proposed consis-tent estimator of π0 based on Fourier method. The largesample theory is also provided. The proposed estimator

371

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


has the form:

π0 =1

m

m∑

i=1

ψ(Zi ; hm),

ψ(z, h) =∫

hψ0(ht)et2/2cos(zt)dt,

where ψ0 is a density function with support on [−1, 1],and hm = {k(logm)}−1/2 is the bandwidth with k ≤ 1.

The posterior probability can be linked to conditionalFDR and when the condition is restricted to a subset ofthe X space, it is called local FDR (Efron 2001).

Some requirements for good performance of EB FDRare independence of the Xi and large samples to ensurethe consistency property.

2.3.4 Two Resampling-Based FDR Procedures (Yeku-tieli and Benjamini 1999)

Often the distribution of null p-values is unknown.Resampling techniques such as shuffling samples frompooled groups can be used to estimate raw and adjustedp-values while preserving the correlation structure ofgenes and avoiding parametric assumptions. For datacontaining high intercorrelations, the resampling-basedmultiple testing procedures use the dependency structureof the data to construct more powerful FDR controllingprocedures.

The data are repeatedly resampled under the completenull to get a vector of resample-based p-value. For eachp, denote the number of resampling-based p-value lessthan p by R∗(p), and denote the estimated number offalse null hypothesis less than p by S(p).

FDR(p) = E(

R∗(p)

R∗(p)+ S(p)

)

S(p) can be estimated in two ways:

• Resampling point estimate: S(p) = r(p) − mp,where r(p) is the number of discoveries.

• Resampling upper estimate: S(p) = r(p) − r∗β(p),where r∗β(p) is the 1− β quantile of R∗(p).

In multiple testing, the general condition for the re-sampling based methods to be valid is that the simulatedjoint distribution of p-values corresponding to the truenull hypotheses, which is generated through the p-valueresampling scheme, represents the real joint distributionof p-values under the null hypotheses. More specific con-ditions for resampling permutation tests to be valid havebeen discussed by Pollard and van der Laan (2004); Du-doit et al. (2003); Huang et al. (2006); and Xu and Hsu(2007). For resampling-based FDR procedures, in orderto derive the theoretical FDR properties, one additionalcondition is that p-values from the null hypotheses andthe p-values from the alternative hypotheses are indepen-dent.

2.3.5 Two Step-Down FDR Procedures

Benjamini and Liu (1999a and 1999b) proposed twostep-down FDR procedures. These procedures have goodpower when the number of multiple tests is small andmost of the false hypotheses are far from being true.

Step-down independent FDR procedure This methodis motivated by the Sidak step-down FWER proceduredescribed in Section 2.2. The procedure requires the teststatistics to be independent.

1. P(1) = 1− (1− P(1))m ,

2. P(i) = max([1− (1− P(i))m−i+1

]× m−i+1

m ,

P(i−1)), for i = 2, 3, . . . ,m.

3. Reject all H( j), j = 1, 2, . . . , k − 1, wherek = min{i : P(i) > ci }, and ci = 1 −[1−min

(m

m−i+1 q, 1)] 1

m−i+1, and q is the desired

FDR level.

Step-down distribution freeThis is an FDR procedure that is distribution free but

has increased power compared to the other distributionfree FDR procedure BY FDR (as described in Section2.3.2) when the number of hypotheses m is small andmost of the false hypotheses are far from being true. Theadjusted p-values are:

1. P(1) = min(m P(1), 1);

2. P(i) = min(max( (m−i+1)2

m P(i), P(i−1)), 1), for i =2, 3, . . . ,m;

3. Reject all H( j), j = 1, 2, . . . , k − 1, where k =min{i : P(i) > ci }; and

4. ci = min(

m(m−i+1)2

q, 1)

, and q is the desired FDR

level.

The two step-down FDR procedures are applicable tothe situation where the number of hypotheses m is smalland most of the false hypotheses are far from being true.Both requirements are questionable for microarray datain practice.

3. Comparisons Using Two RealMicroarray Data

3.1 Leukemia Study

The first microarray dataset is from a leukemia study(Golub et al. 1999). There are 38 tumor samples from twoleukemia types: 27 acute myeloid leukemia (AML) sam-ples and 11 acute lymphoblastic leukemia (ALL) sam-ples.

372

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


Table 2. Leukemia study: number of differentially expressed genes at cutoff levels 0.01 and 0.05, total number of genes is 3051

Cutoff Cutoff ˆFDR(%)Method 0.01 0.05 at cutoff=0.05

Rawp 661 1045 13.880Bonferroni 70 98 0.075Holm 71 98 0.075Hochberg 71 98 0.075Sidak 70 98 0.075SidakSD 71 98 0.075BH 367 681 4.698BY 146 269 0.599DwnInd 71 98 0.075DwnFree 71 98 0.075Adaptive1 409 751 6.289Adaptive2 389 737 6.009Q-value 492 876 9.351Resamp.point 382 696 NAResamp.upper 361 663 NA

Note: ˆFDR is the estimate of FDR based on 1000 permutations when the targeted FDR = 0.05.

The study aimed at identifying genes that were differ-entially expressed between ALL and AML. The origi-nal data contained 6817 genes. The data were prepro-cessed by Dudoit [see Golub et al. (1999) and Dudoit etal. (2002) for details] in three steps. (i) thresholding witha floor of 100 and ceiling of 16,000; (ii) filtering by ex-clusion of genes with ratio max /min ≤ 5 and difference(max−min) ≤ 500, where max and min refer respec-tively to the maximum and minimum expression levels ofa particular gene across mRNA samples; and (iii) log10transforming. This resulted in 3051 genes for which datawere available to use for comparing procedures. The dataare available in the multtest package from Bioconduc-tor (http://www.bioconductor.org/biocLite.R).

Two sample t-tests were used to identify differentiallyexpressed genes between ALL and AML. The unadjustedp-values and the p-values adjusted using the variousFWER or FDR methods were calculated. The numbersof differentially expressed genes using p-value cutoffs of0.01 or 0.05 are listed in Table 2.

We also estimated the actual FDR at a cutoff of 0.05 bypermuting the sample labels, computing the t-statisticsbased on the permuted data and applying the variousFWER or FDR methods to get the adjusted p-values.For each multiple testing method, the average numberof genes with adjusted p-values less than 0.05 basedon 1000 permuted datasets was the estimated number offalsely identified genes. The actual FDR was this aver-age divided by the number of significant genes identifiedin the original data. See Table 2 for the actual FDRs foreach method.

As shown, the FWER methods were all very conser-vative with estimated actual FDRs of 0.075%. The two

step-down FDR methods were close to the FWER meth-ods. The FDR BY method was also quite conservative,with an estimated actual FDR of 0.599%. The FDR BHmethod performed well with an estimated actual FDRslightly less than 5%. The two adaptive methods hadhigher power than the FDR BH method, but the estimatedactual FDR was a little greater than 5%. The Q-value pro-cedure had the highest power, but it was liberal with anestimated actual FDR of 9.35%.

3.2 Breast Cancer Study

The second microarray dataset is from a breast cancerstudy (West et al. 2001; Huang et al. 2003) investigat-ing the association between the presence of lymph nodemetastases, cancer recurrence and gene expression data.We used a subset of patients with one to three positivelymph nodes and studied the recurrence three years afterprimary surgery. The dataset provided expression pro-files for 52 cases in this lymph node category (34 non-recurrent, 18 recurrent). We wanted to identify the dif-ferentially expressed genes between recurrent and non-recurrent patients. The gene expression data were nor-malized (regression of gene profile of each sample ver-sus median gene profile of 52 samples) and log2 trans-formed. Those genes with absent calls across all the sam-ples were removed, ending up with 9397 genes. Thisdataset can be downloaded from the following website:http://data.genome.duke.edu/ lancet.php.

Two sample t-tests were used to identify differentiallyexpressed genes between recurrent and nonrecurrent pa-tients. The unadjusted p-values and the adjusted p-values using different FWER or FDR methods were cal-

373

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014

http://data.genome.duke.edu/lancet.php

http://www.bioconductor.org/biocLite.R


Table 3. Breast cancer data: Number of differentially expressed genes at cutoff levels 0.01 and 0.05, total number of genes is 9397

Cutoff Cutoff ˆFDR(%)Method 0.01 0.05 at cutoff = 0.05

Rawp 859 1912 23.950Bonferroni 2 10 0.35Holm 2 10 0.35Hochberg 2 10 0.35Sidak 2 11 0.364SidakSD 2 11 0.364BH 58 368 4.538BY 0 15 0.400DwnInd 2 11 0.364DwnFree 2 10 0.350Adaptive1 61 396 4.896Adaptive2 59 378 4.703Q-value 122 653 7.573Resamp.point 88 408 NAResamp.upper 61 378 NA

Note: ˆFDR is the estimate of FDR based on 1000 permutations when the targeted FDR = 0.05.

culated. The numbers of differentially expressed genesand the actual FDR estimates are listed in Table 3.

Results were similar to those for the leukemia data.The FWER methods were all very conservative with ac-tual FDRs around 0.35%. The two step-down FDR meth-ods and the FDR BY method were also quite conserva-tive. The FDR BH method performed well with an esti-mated actual FDR slightly lower than 5%. The two adap-tive methods also controlled FDR at 5% with higher pow-ers. The Q-value procedure had the highest power, butit could not control FDR at the specified level with theestimated actual FDR being 7.57%. These results wereconsistent with the findings in Qian and Huang (2005),but fewer methods were compared in their paper, and nosimulations were studied.

4. Simulations

Simulation studies were also used to evaluate theperformance of selected multiple testing procedures in-cluding FWER methods and FDR methods under vari-ous conditions. The methods chosen for comparison arelisted in Table 1. The criteria for performance evaluationwere the achieved false discovery rate (FDR) and power.

4.1 Simulation Settings

For each simulation, we constructed artificial gene ex-pression data matrices with m = 1000 genes (rows) andn/2 samples (columns) for each of two groups: treatmentand control. The gene expression profiles for each sam-ple were generated from a multivariate normal distribu-tion with mean vector µ and covariance matrix 6. For

each simulation, we varied the number of samples (5, 25,100, 500), the number of significant genes (50, 200, 500),the mean vector µ = µ1 for the treatment group, andthe covariance structure 6 (correlated or independent).The mean vector µ = µ2 for the control group was setto 0 for all the simulations. The mean vector µ1 for thetreated group varied to include scenarios of different pro-portions of differentially expressed genes, different di-rections (upper or down) of regulation, and different in-tensity levels. Details of the 18 different simulation pa-rameter settings can be found in Table 4. There were 500runs for each simulation setting.

For covariance structure 6, since some multiple test-ing procedures require that the test statistics be indepen-dent, while others can handle correlated data, we gener-ated independent data, highly correlated data, and datawith correlation structures close to microarray data. Forthe latter, we used the sample covariance structures fromthe two microarray studies discussed in the previous sec-tion. We randomly chose the gene expression profiles of1000 genes from the preprocessed data, and used theircovariance structures in our simulation studies.

Similar simulation scenarios for mean vectors and co-variance structures can be found in Dudoit et al. (2003).Their paper focused on comparing a number of FWERmultiple testing procedures, some PCER (per compari-son error rate) procedures, and two FDR procedures, BHFDR and BY FDR method. In this article, we focus oncomparing a variety of FDR multiple testing procedures,including BH, BY, step-down, Q-value, adaptive, resam-pling, and empirical Bayes methods (See Table 1).

In addition to the 18 simulations in Table 4, we alsoperformed simulations under unequal covariance struc-

374

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


Table 4. Simulation setup

Number of Samplesignificant size

Sim genes per Covarianceindex m1 group structure Mean vector for Group 1

1 200 5 indep, sd = 0.5 c(rep(1,200), rep(0,800))

2 200 25 indep, sd = 0.5 c(rep(1,200), rep(0,800))

3 200 100 indep, sd = 0.5 c(rep(1,200), rep(0,800))

4 200 500 indep, sd = 5 c(rep(1,200), rep(0,800))

5 200 5 Golub c(2*(1:100)/100,−2*(1:100)/100,rep(0,800))

6 200 25 Golub c(2*(1:100)/100,−2*(1:100)/100,rep(0,800))

7 500 25 Golub c(2*(1:250)/250,−2*(1:250)/250,rep(0,500))

8 200 5 West c(2*(1:100)/100,−2*(1:100)/100,rep(0,800))

9 200 25 West c(2*(1:100)/100,−2*(1:100)/100,rep(0,800))

10 50 5 West c(2*(1:25)/25, −2*(1:25)/25,rep(0,950))

11 50 25 West c(2*(1:25)/25, −2*(1:25)/25,rep(0,950))

12 500 25 West c(2*(1:250)/250, −2*(1:250)/250,rep(0,500))

13 200 25 ρ = ±0.8 for block of genes, c(2*(1:100)/100, −2*(1:100)/100,rep(0,800))see Table 5a, sd = 2

14 200 25 ρ = ±0.8 for block of genes, c(2*(1:100)/100, −2*(1:100)/100,rep(0,800))see Table 5a, sd = 1

15 200 25 ρ = ±0.8 for block of genes, c(2*(1:100)/100, −2*(1:100)/100,rep(0,800))see Table 5b, sd = 2

16 200 25 ρ = ±0.8 for block of genes, c(2*(1:100)/100, −2*(1:100)/100,rep(0,800))see Table 5b, sd = 1

17 200 25 ρ = 0.8 for all genes, sd = 2 c(2*(1:100)/100, −2*(1:100)/100,rep(0,800))

18 200 25 ρ = 0.8 for all genes, sd = 1 c(2*(1:100)/100, −2*(1:100)/100,rep(0,800))

Note: The total number of genes are 1000, and the mean vector for group 2 is 0 for all the simulations.

tures and unequal sample sizes across the two groups.

4.2 Calculation of FDR and Power

For each of the 18 simulated dataset in Table 4,two-sample t-statistics with equal variances were usedto compare the expression levels in the two treatmentgroups for each gene. The multiple testing proceduresdescribed in Table 2 were then carried out to determinewhich genes were differentially expressed. The signifi-cant level was set to be 0.05.

For each multiple testing procedure, two statisticswere calculated: the actual FDR, which was defined asthe proportion of false rejections among all rejections,and power, which was defined as the proportion of trulydifferentially expressed genes being identified. The ob-jectives were to see whether the procedure controlledFDR, and to compare the efficiency of the proceduresbased on power.

The steps for computing FDR and power were as fol-lows:

1. For each multiple testing procedure in the bth sim-ulation (b = 1, . . . , B), count the number of genes iden-tified as significant (Rb), the number of falsely identi-fied genes (Vb), and the number of truly differentially ex-pressed genes identified (Sb), and compute the FDR as

Qb ={

Vb/Rb if Rb > 00 if Rb = 0

.

2. After the Bth simulation, calculate the actual FDRand power as

FDR =∑B

b=1 QbB , Power =

∑Bb=1(Sb/m1)

B ,

where m1 is the number of truly differentially expressedgenes. Both the actual FDR and power were averagedover B = 500 repeated simulations.

4.3 Simulation Results

The results of simulations 1–18 are shown in Tables6a–6e, where actual FDRs and powers for the variousmultiple testing procedures are compared.

375

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


Table 5a. Correlation structures for simulation 13–14

0.8 −0.8−0.8 0.8

0.8 0.8 −0.8 −0.80.8 0.8 −0.8 −0.8−0.8 −0.8 0.8 0.8−0.8 −0.8 0.8 0.8

0.8 0.8 −0.8 −0.80.8 0.8 −0.8 −0.8−0.8 −0.8 0.8 0.8−0.8 −0.8 0.8 0.8

Table 5b. Correlation structures for simulation 15–16

0.8 −0.8−0.8 0.8

0.8 0.8 −0.8 −0.80.8 0.8 −0.8 −0.8−0.8 −0.8 0.8 0.8−0.8 −0.8 0.8 0.8

Note: Each cell represents a block of 100 by 100 genes in Table 5a and Table 5b.

All FWER approaches were too conservative in con-trolling FDRs as seen in Tables 6a–6e, and are not rec-ommended for microarray data. Across all the simula-tion studies, the step down FDR procedures and the BYFDR procedure were conservative, and had low power,which could be close to those of FWER procedures. Inthe following section, we will focus on comparing theother FDR procedures.

4.3.1 Condition 1: Independent Test Statistics (simula-tions 1–4)

All the FDR multiple testing procedures closely con-trolled FDR at the specified level of 0.05 except the em-pirical bayes method (Zhang and Tang), the step downFDR methods and the BY FDR method. Q-Value wasthe most powerful method, followed by the two step-upadaptive methods, and the actual FDRs were very closeto 0.05. The BH method and resampling methods were alittle less powerful. For small sample sizes (simulations1–4), empirical Bayes method could not control FDR atthe specified level because the test statistics follow a t-distribution while this method requires that the test statis-tics for the null be normal. However, for large samplesizes (simulations 3–4), the empirical Bayes method per-formed quite well. Though not tabulated here, simula-tions were also done varying the effect size, and the con-clusions were the same. In summary, for independent test

statistics, the order of the FDR methods based on powerwas: Q-value ≥ adaptive methods > FDR BH, resam-pling methods > FDR BY, FDR step-down method.

4.3.2 Condition 2: Correlated Test Statistics with Cor-relation Structures Close to Real MicroarrayData (simulations 5-12)

Simulations 5–7 used the correlation structure fromthe Golub data, and simulations 8–12 used the correla-tion structure from the breast cancer data. As expected,the empirical Bayes method did not control FDR sinceit was designed for large samples and independent data.The resampling-based methods, BH method, and adap-tive methods controlled FDR at the specified level. TheQ-value method tended to be a little liberal, and did notconsistently control FDR at the specified level. The pow-ers for the resampling methods and BH method werequite close. The Q-value method, followed by adaptivemethods were more powerful especially if the proportionof significant genes was large (see simulations 7 and 12).For correlated test statistics with correlation structuresclose to real microarray data, the order of the FDR meth-ods based on power was: Q-value > adaptive methods >FDR BH, resampling methods> FDR BY, and FDR stepdown methods. Note that Q-value may not control FDRat the specified level.

According to the simulations in Li et al. (2005), all

376

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


Table 6a. Simulation result (independent data)

Sim Index 1 1 2 2 3 3 4 4Parameter fdr power fdr power fdr power fdr power

Rawp 0.2015 0.7905 0.1653 1.0000 0.1675 1.0000 0.1832 0.8848Bonferroni 0.0090 0.0145 0.0002 0.9917 0.0002 1.0000 0.0012 0.1823

Holm 0.0090 0.0145 0.0002 0.9933 0.0003 1.0000 0.0012 0.1848Hochberg 0.0090 0.0145 0.0002 0.9933 0.0003 1.0000 0.0012 0.1848

Sidak 0.0088 0.0148 0.0002 0.9919 0.0002 1.0000 0.0012 0.1839SidakSD 0.0088 0.0148 0.0003 0.9935 0.0003 1.0000 0.0012 0.1864

BH 0.0403 0.2475 0.0393 1.0000 0.0401 1.0000 0.0394 0.6783BY 0.0025 0.0055 0.0055 0.9998 0.0053 1.0000 0.0056 0.3715

DwnInd 0.0088 0.0149 0.0003 0.9947 0.0004 1.0000 0.0012 0.1888DwnFree 0.0110 0.0148 0.0003 0.9945 0.0003 1.0000 0.0012 0.1871Q-value 0.0508 0.3230 0.0498 1.0000 0.0507 1.0000 0.0496 0.7132

Adaptive1 0.0452 0.2862 0.0490 1.0000 0.0502 1.0000 0.0463 0.7032Adaptive2 0.0420 0.2614 0.0495 1.0000 0.0507 1.0000 0.0456 0.7012

Resamp.point 0.0384 0.2331 0.0395 1.0000 0.0404 1.0000 0.0395 0.6780Resamp.upper 0.0266 0.1453 0.0383 1.0000 0.0391 1.0000 0.0383 0.6732

EBayes.ZT 0.1673 0.7339 0.0629 1.0000 0.0537 1.0000 0.0511 0.7177

Table 6b. Simulation result (covariance structure was based on Golub data)

Sim Index 5 5 6 6 7 7parameter fdr power fdr power fdr power

Rawp 0.2168 0.6564 0.1786 0.8556 0.0545 0.8526Bonferroni 0.0022 0.1063 0.0004 0.7013 0.0001 0.6750

Holm 0.0023 0.1074 0.0004 0.7045 0.0001 0.6832Hochberg 0.0023 0.1074 0.0004 0.7045 0.0001 0.6832

Sidak 0.0023 0.1077 0.0004 0.7018 0.0001 0.6754SidakSD 0.0024 0.1085 0.0004 0.7051 0.0001 0.6837

BH 0.0404 0.4332 0.0415 0.8075 0.0249 0.8244BY 0.0054 0.2033 0.0061 0.7620 0.0030 0.7639

DwnInd 0.0023 0.1096 0.0005 0.7086 0.0001 0.6928DwnFree 0.0024 0.1084 0.0005 0.7080 0.0001 0.6920Q-value 0.0566 0.4605 0.0595 0.8141 0.0517 0.8467

Adaptive1 0.0458 0.4473 0.0498 0.8120 0.0426 0.8423Adaptive2 0.0449 0.4450 0.0500 0.8123 0.0437 0.8433

Resamp.point 0.0423 0.4358 0.0455 0.8097 0.0266 0.8267Resamp.upper 0.0377 0.4221 0.0403 0.8064 0.0241 0.8230

EBayes.ZT 0.1711 0.6171 0.0770 0.8218 0.0598 0.8527

377

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


Table 6c. Simulation Result (Covariance structure was based on breast cancer data)

Sim index 8 8 9 9 10 10 11 11 12 12parameter fdr power fdr power fdr power fdr power fdr power

Rawp 0.2597 0.5049 0.1931 0.7901 0.6307 0.4970 0.4905 0.8160 0.0578 0.7759Bonferroni 0.0068 0.0304 0.0003 0.5411 0.0192 0.0438 0.0013 0.5358 0.0001 0.5139

Holm 0.0070 0.0305 0.0004 0.5448 0.0189 0.0439 0.0014 0.5370 0.0002 0.5230Hochberg 0.0070 0.0305 0.0004 0.5448 0.0189 0.0439 0.0014 0.5370 0.0002 0.5230

Sidak 0.0070 0.0309 0.0003 0.5419 0.0184 0.0444 0.0013 0.5369 0.0001 0.5146SidakSD 0.0070 0.0310 0.0004 0.5457 0.0184 0.0444 0.0015 0.5379 0.0002 0.5238

BH 0.0367 0.1881 0.0405 0.7090 0.0465 0.0992 0.0440 0.6622 0.0243 0.7276BY 0.0063 0.0303 0.0055 0.6322 0.0065 0.0221 0.0056 0.5832 0.0034 0.6386

DwnInd 0.0071 0.0312 0.0005 0.5494 0.0184 0.0446 0.0015 0.5390 0.0002 0.5339DwnFree 0.0070 0.0307 0.0005 0.5485 0.0227 0.0455 0.0015 0.5382 0.0002 0.5328Q-value 0.0485 0.2109 0.0545 0.7185 0.0599 0.1076 0.0545 0.6658 0.0464 0.7581

Adaptive1 0.0397 0.1961 0.0470 0.7155 0.0487 0.1003 0.0459 0.6639 0.0384 0.7507Adaptive2 0.0380 0.1924 0.0473 0.7158 0.0469 0.0997 0.0456 0.6641 0.0390 0.7517

Resamp.point 0.0370 0.1880 0.0425 0.7110 0.0465 0.0969 0.0480 0.6653 0.0255 0.7301Resamp.upper 0.0321 0.1591 0.0391 0.7070 0.0329 0.0726 0.0405 0.6598 0.0232 0.7253

EBayes.ZT 0.1846 0.4319 0.0717 0.7313 0.4001 0.3547 0.0883 0.6878 0.0548 0.7688

Table 6d. Simulation Result (Genes were highly correlated within each block)

Sim Index 13 13 14 14 15 15 16 16Parameter fdr power fdr power fdr power fdr power

Rawp 0.2018 0.4392 0.1367 0.7155 0.3013 0.4510 0.1835 0.7230Bonferroni 0.0015 0.0359 0.0005 0.3739 0.0123 0.0386 0.0006 0.3794

Holm 0.0015 0.0363 0.0005 0.3774 0.0123 0.0390 0.0006 0.3826Hochberg 0.0015 0.0363 0.0005 0.3774 0.0123 0.0390 0.0006 0.3826

Sidak 0.0015 0.0363 0.0005 0.3749 0.0128 0.0390 0.0006 0.3805SidakSD 0.0015 0.0367 0.0005 0.3784 0.0128 0.0395 0.0006 0.3838

BH 0.0150 0.1639 0.0273 0.5946 0.0301 0.1764 0.0344 0.6022BY 0.0001 0.0620 0.0057 0.4846 0.0028 0.0704 0.0046 0.4910

DwnInd 0.0015 0.0372 0.0006 0.3818 0.0128 0.0400 0.0006 0.3875DwnFree 0.0335 0.0392 0.0005 0.3806 0.1443 0.0412 0.0006 0.3863Q-value 0.0927 0.2387 0.1055 0.6441 0.0597 0.2131 0.0686 0.6291

Adaptive1 0.0244 0.1751 0.0361 0.6043 0.0376 0.1872 0.0442 0.6121Adaptive2 0.0173 0.1688 0.0322 0.6025 0.0318 0.1823 0.0405 0.6112

Resamp.point 0.0465 0.2259 0.0413 0.6232 0.0443 0.2030 0.0443 0.6164Resamp.upper 0.0380 0.2059 0.0352 0.6091 0.0413 0.1879 0.0384 0.6091

EBayes.ZT 0.1026 0.2707 0.1053 0.6492 0.0906 0.2543 0.0975 0.6462

378

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


Table 6e. Simulation Result (All the genes were highly correlated)

Sim Index 17 17 18 18Parameter fdr power fdr power

Rawp 0.1426 0.4489 0.0948 0.7217Bonferroni 0.0003 0.0409 0.0016 0.3767

Holm 0.0003 0.0413 0.0018 0.3806Hochberg 0.0003 0.0413 0.0018 0.3807

Sidak 0.0003 0.0415 0.0017 0.3781SidakSD 0.0003 0.0417 0.0018 0.3816

BH 0.0317 0.1529 0.0184 0.6009BY 0.0063 0.0532 0.0032 0.4907

DwnInd 0.0003 0.0420 0.0020 0.3848DwnFree 0.0003 0.0426 0.0020 0.3837Q-value 0.1646 0.2936 0.1691 0.6936

Adaptive1 0.0563 0.1777 0.0426 0.6211Adaptive2 0.0357 0.1584 0.0234 0.6099

Resamp.point 0.0534 0.2752 0.0366 0.6452Resamp.upper 0.0421 0.2544 0.0286 0.6390

EBayes.ZT 0.1414 0.3070 0.1243 0.6724

the FDR methods they studied (FDR BH, Q-value, re-sampling method) could not control FDR at the speci-fied level when the proportion of differentially expressedgenes was below 10%. They recommended adjusting theprespecified level by half if the proportion of positivegenes is below 10%. In our simulations, FDR BH con-trolled FDR at the specified level even when the pro-portion of positive genes was low. This inconsistencycould be related to the way that Li et al. (2005) gen-erated the data. They first removed the potential differ-ences between the two groups not attributable to noise bysubtracting the group means from the original microarraydata and then adding back the overall means to get a mod-ified dataset. The simulated data for the first group weregenerated by repeatedly resampling from the modifieddata in the first group, and the simulated data for the sec-ond group were generated by repeatedly resampling fromthe modified data in the second group. This procedurewill provide an estimated test statistic null distributionwhich is asymptotically correct (as sample size goes toinfinity); see Pollard and van der Laan (2004) and Huanget al. (2006). However the simulations in Li et al. (2005)used 80 samples in each group, which might not be largeenough. It is possible that with sufficiently large samplesize, the FDR BH could control FDR even when the pro-portion of differentially expressed genes is below 10%.For example, using the Golub data with a sample size of500 in each group, the FDR BH controlled FDR evenwhen the proportion of differentially expressed geneswas 1%. With a sample size of 10 per group, however,the FDR BH procedure inflated FDR even when the pro-portion of differentially expressed genes was higher than20%.

To explore this further, we modified the data as above,but generated the simulated data by resampling from thepooled data. A subset of positive genes was then ran-domly selected and differences added. With these data,the FDR BH method controlled FDR at the specifiedlevel even when the proportion of differentially expressedgenes was low and the sample size was not very large.If the sample size is the same for the two groups, re-sampling from the two group pooled data produces anasymptotically correct test statistics null distribution andmay be more efficient for small sample sizes, see Pol-lard and van der Laan (2004). However, we need to becautious here. According to Huang et al. (2006), Xu andHsu (2007), and Calian et al. (2008), if the sample size isnot large enough—even if the sample sizes are equal forthe two groups—differences in cumulants of order higherthan two between the two groups can cause the permuta-tion method to be liberal. Xu and Hsu (2007) and Calian,Li (2008) showed that the marginals-determine-the-jointcondition (MDJ) must be satisfied to correctly estimatethe null distribution.

4.3.3 Condition 3: Highly Correlated Test Statistics(Simulations 13-18)

In simulations 13–16, the genes were divided intoblocks, and were highly correlated within each block. SeeTables 5a and 5b for the correlation structures. In simu-lations 17–18, all the genes were highly positively corre-lated with ρ = 0.8.

Generally speaking, resampling-based methods out-performed other methods for highly correlated data.The BH method, the two-stage step-up adaptive method(adaptive2) and the resampling upper limit method con-

379

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


trolled FDR at 0.05 for all the simulations using highlycorrelated data (simulations 13–18). The methods dif-fered in power with the resampling upper limit methodbeing the most powerful followed by the two-stage step-up adaptive method, and the two-stage step-up adaptivemethod (adaptive2) and the BH method in decreasing or-der. The resampling point estimate method and the step-up adaptive method (adaptive1) controlled FDR at 0.05for simulations 13–16, and nearly controlled FDR in sim-ulations 17–18. However, the power estimates for theresampling-based methods were much higher than theother methods. As the number of highly correlated genesdecreased (simulation 15), the advantage of resampling-based methods decreased. The actual FDRs from the Q-value and the empirical Bayes methods were too high.This was consistent with the findings in Benjamini,Krieger, and Yekutieli (2006), where they found that theBH method and the two-stage step-up adaptive methodcontrolled FDR in all their simulations with correlateddata, while the adaptive linear step-up method might notcontrol FDR, and the Q-value method could be liberal forhighly correlated data.

For highly correlated data, the order of the FDR meth-ods based on power was: resampling methods ≥ adaptivemethods > FDR BH method > FDR BY, and FDR stepdown methods. Of the two resampling methods, the re-sampling point estimate method was more powerful thanthe resampling upper limit method. Of the two adap-tive methods, the adaptive linear step-up method wasmore powerful than the two-stage adaptive linear step-upmethod. Note that Q-value was not included here since itis liberal for highly correlated data.

4.3.4 Unequal Variance-Covariance Matrices BetweenTwo Groups

We also did simulations under unequal variance-covariance matrices between the two groups with sim-ulated independent or correlated gene expression data.Further, we also did simulations under both unequalvariance-covariance matrices and unequal sample sizeswith small and large sample sizes.

For each dataset, we used the two-sample Welch’s t-statistic to compare the expression levels in the two treat-ment groups for each gene, and then applied the multi-ple testing procedures described in Table 2 to determinewhich genes were differentially expressed. The conclu-sions for comparing different FDR procedures based onthe simulations with unequal variances were consistentwith those based on the simulations with equal variancesexcept for resampling based FDR methods.

We observed that if the sample sizes were too small(5 samples per group in our simulations) or the samplesizes were not equal for the two groups, the resamplingbased FDR methods did not control FDR at the speci-

fied level when the variance-covariance structures for thetwo groups were not the same. If the study was balancedand the sample size was moderate, the resampling basedmethods controlled FDR and had more power than otherFDR procedures for highly correlated data.

These results can be explained knowing that if thevariance-covariance structures are different for the twogroups and the sample sizes are not equal, the permu-tation method will not provide the correct null distri-bution; see Huang et al. (2006). For unequal variance-covariance structures, the permutation method producesan asymptotically correct null distribution if the samplesizes are equal; see Pollard and van der Laan (2004). Ifthe data are multivariate normal and the sample sizes areequal, the permutation distribution of the test statistic co-incides with the distribution under the null hypothesis.In our simulations, the resampling FDR methods slightlyinflated the FDR for small sample sizes (5 samples pergroup) for multivariate normal data with equal samplesizes, because the variances in the denominators of thet-statistics may not be estimated correctly for very smallsamples; see Pollard and van der Laan (2004). If the sam-ple size is not large enough and the data are not multivari-ate normal, even if the sample sizes are the same for thetwo groups, the permutation method may not provide thecorrect null distribution. Xu and Hsu (2007) and Calian etal. (2008) showed that the marginals-determine-the-jointcondition (MDJ) must be satisfied to correctly estimatethe null distribution using the permutation method.

5. Conclusions and Discussions

We compared various FWER and FDR procedures us-ing both real and simulated microarray datasets. All theFWER approaches were highly conservative with lowpower, and therefore are not well suited for typical mi-croarray data. The step-down FDR procedures and theBY FDR procedure were also conservative with powersclose to those of FWER procedures. According to Ben-jamini and Liu (1999a), the step-down FDR procedureshave higher power when the number of tested hypothesesis small and many of the hypotheses are far from beingtrue. This was not the case for our microarray studies,which might account for the poor performance.

BH method controlled FDR in all of our simulationsand real microarray examples with reasonable power.Since it is designed to control FDR at level q ∗ m0/m,it can be conservative if the proportion of differentiallyexpressed genes is large, as seen in results from simula-tions 7 and 12.

Adaptive and Q-value methods are expected to havebetter performance because they estimate m0 and controlFDR at level q instead of q ∗ m0/m. If the test statis-

380

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014


tics were independent, both Q-value and adaptive meth-ods controlled FDR at the specified level. However, ifthe test statistics were correlated and the proportion ofdifferentially expressed genes was small, adaptive meth-ods and Q-value had little advantage. For correlated teststatistics, the actual FDR using Q-value method can be alittle higher than specified level, and the Q-value methodcan be very liberal for highly correlated test statistics.

The two-stage step-up adaptive method (adaptive2)controlled FDR in all simulations. The step-up adaptivemethod (adaptive1) controlled FDR in most situations.However, the actual FDR was a little higher than thespecified level if the test statistics were highly correlated.If the proportion of differentially expressed genes is large(e.g., if a preselection for possibly interesting genes isdone), the two-stage adaptive method has an advantage.

With regards to power, Q-value had higher power thanthe adaptive methods, and the adaptive methods hadhigher power than the BH method.

When the test statistics were not highly correlated andthe required conditions for the resampling-based meth-ods were satisfied, the performance of resampling-basedmethods were quite close to the BH method. It is knownthat both the BH method and the resampling based meth-ods are conservative when the proportion of alternativehypotheses is large; see Yekutieli and Benjamini (1999).For highly correlated statistics, resampling FDR meth-ods outperformed all other methods, especially when theproportion of positive hypotheses was small.

For microarray studies involving many genes, the dataare likely to be correlated but not highly correlated formost genes, and the proportion of differentially expressedgenes may not be high. BH is a quick, easy, and satisfac-tory procedure to control FDR with reasonable power.If the proportion of differentially expressed genes is ex-pected to be high, the two-stage adaptive method can beused. The Q-value method can also be used if one wantsto identify more genes and one is willing to accept possi-bly higher than the specified FDR level.

[Received October 2008. Revised May 2009.]

REFERENCES

Benjamini, Y., and Hochberg, Y. (1995),“Controlling the False Discov-ery Rate—A Practical and Powerful Approach to Multiple Testing,”Journal of Royal Statistical Society, Series B, 57 (1), 289–300. 369,370

Benjamini, Y., and Liu, W. (1999a), “A Step-Down Multiple Hypothe-ses Testing Procedure that Controls the False Discovery Rate UnderIndependence,” Journal of Statistical Planning and Inference, 82,163–170. 370, 372, 380

Benjamini, Y., and Liu, W. (1999b), “A Distribution-Free Multiple-TestProcedure that Controls the False Discovery Rate,” Research paper99-3, Department of Statistics and Operation Research, Tel Aviv

University, available online at http://www.math.tau.ac.il/∼ybenja/ .372

Benjamini, Y., and Yekutieli, D. (2001), “The Control of the False Dis-covery Rate in Multiple Testing under Dependency,” The Annals ofStatistics, 29(4), 1165–1188. 370, 371

Benjamini, Y., Krieger, A., and Yekutieli, D. (2006), “Adaptive Lin-ear Step-Up Procedures that Control the False Discovery Rate,”Biometrika, 93(3), 491–507. 369, 370, 371, 380

Calian, V., Li, D., and Hsu, J. (2008), “Partitioning to Uncover Condi-tions for Permutation Tests to Control Multiple Testing Error Rates,”Biometrical Journal, 50, 756–766. 379, 380

Dudoit, S., Fridlyand, J., and Speed, T. P. (2002), “Comparison of Dis-crimination Methods for the Classification of Tumors using GeneExpression Data,” Journal of the American Statistical Association,97, 77–87. 373

Dudoit, S., Shaffer., J. P., and Boldrick, J. C. (2003), “Multiple Hypoth-esis Testing in Microarray Experiments,” Statistical Science, 18, 1,71–103. 369, 372, 374

Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2001), “Empir-ical Bayes Analysis of a Microarray Experiment,” Journal of theAmerican Statistical Association, 96, 1151–1160. 372

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M.,Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A.,Bloomfield, C.D., and Lander, E.S. (1999), “Molecular Classifica-tion of Cancer: Class Discovery and Class Prediction by Gene Ex-pression Monitoring,” Science, 286, 531–537. 372, 373

Hochberg, Y. (1988), “A Sharper Bonferroni Procedure for MultipleSignificance Testing,” Biometrika, 75, 800–803.

Holm, S. (1979), “A Simple Sequentially Rejective Bonferroni TestProcedure,” Scandinavian Journal of Statistics, 6, 65–70.

Huang, E., Cheng, S.H., Dressman, H., Pittman, J., Tsou, M.H., Horng,C.F., Bild, A., Iversen, E.S., Liao, M., Chen, C.M., West, M.,Nevins, J.R., and Huang, A.T. (2003), “Gene Expression Predictorsof Breast Cancer Outcomes,” Lancet, 361, 1590–1596. 373

Huang, Y., Xu, H., Calian, V., and Hsu, J. (2006), “To Permute or notto Permute,” Bioinformatics, 22(18), 2244–2248. 372, 379, 380

Li S.S., Bigler, J., Lampe, J.W., Potter, J.D., and Feng, Z. (2005), “FDR-Controlling Testing Procedures and Sample Size Determination forMicroarrays,” Statistics in Medicine, 24(15), 2267–2280. 369, 376,379

Pollard, K. S., and van der Laan, M. (2004), “Choice of a Null Distribu-tion in Resampling-Based Multiple Testing,” Journal of StatisticalPlanning and Inference, 125, 85–100. 372, 379, 380

Qian, H., and Huang, S. (2005), “Comparison of False Discovery RateMethods in Identifying Genes with Differential Expression,” Ge-nomics, 86, 495–503. 369

Reiner, A., Yekutieli, D., and Benjamini, Y. (2003), “Identifying Dif-ferentially Expressed Genes using False Discovery Rate ControllingProcedures,” Bioinformatics, 19 (3), 368–375. 370

Schweder, T., and Spjøtvoll, E. (1982), “Plots of p-values to EvaluateMany Tests Simultaneously,” Biometrika, 69, 493–502. 371

Sidak, Z. (1967), “Rectangular Confidence Regions for the Means ofMultivariate Normal Distributions,” Journal of the American Statis-tical Association, 62, 626–633.

Storey, J. (2002), “A Direct Approach to False Discovery Rates,” Jour-nal of the Royal Statistical Society, Series B, 64, 479–498. 370

Storey, J., and Tibshirani, R. (2003), “Statistical Significance forGenomewide Studies,” Proceedings of the National Academy of Sci-ences USA, 100, 9440–9445. 371

West, M. , Blanchette, C., Dressman, H. , Huang, E. , Ishida, S., Spang,

381

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014

http://www.math.tau.ac.il/~ybenja/

http://www.math.tau.ac.il/~ybenja/


R., Zuzan, H. , Marks, J.R., and Nevins, J.R. (2001), “Predictingthe Clinical Status of Human Breast Cancer Utilizing Gene Expres-sion Profiles,” Proceedings of the National Academy of Sciences, 98,11462–11467. 373

Xu, H., and Hsu, J. C. (2007), “Using the Partitioning Principle to Con-trol the Generalized Family Error Rate,” Biometrical Journal, 49,52–67. 372, 379, 380

Yekutieli, D., and Benjamini, Y. (1999), “Resampling-Based False Dis-covery Rate Controlling Multiple Test Procedures for CorrelatedTest Statistics,” Journal of Statistical Planning and Inference, 82,171–196. 372, 381

Zhang, C., and Tang, W. (2005), “Empirical Bayes and ConditionalFalse Discovery Rate,” Technical Report, Department of Statistics,Rutgers University. 370, 371

About the Authors

Donghui Zhang is Distinguished Statistician, and Li Liu is Se-nior Principal Statistician, Biostatistics and Programming, Sanofi-Aventis, Mail Stop BX2-403A, Bridgewater, NJ 08807 (E-mail:[email protected]).

382

Dow

nloa

ded

by [

Que

ensl

and

Uni

vers

ity o

f T

echn

olog

y] a

t 05:

21 3

1 O

ctob

er 2

014

mailto:[email protected]

multiple comparisons in microarray data analysis

Documents