arxiv:1310.0424v2 [q-bio.qm] 12 dec 2013 · count proportions that vary more than expected under a...

Waste Not, Want Not:Why Rarefying Microbiome Data Is Inadmissible

Paul J. McMurdie1, Susan Holmes1∗1 Statistics Department, Stanford UniversityStanford, CA, USA∗ E-mail: [email protected]

AbstractCurrent practice in the normalization of microbiomecount data is inefficient in the statistical sense. Forapparently historical reasons, the common approach iseither to use simple proportions (which does not ad-dress heteroscedasticity) or to use rarefaction of counts,even though both of these approaches are inappropriatefor detection of differentially abundant species. Well-established statistical theory is available that simultane-ously accounts for library size differences and biologicalvariability using an appropriate mixture model. More-over, specific implementations for DNA sequencingread count data (based on a Negative Binomial modelfor instance) are already available in RNA-Seq focusedR packages such as edgeR and DESeq. Here we summa-rize the supporting statistical theory, and use simulationsand empirical data to precisely demonstrate the substan-tial improvements provided by a relevant mixture modelapproach over simple proportions or rarefying. We showhow both proportions and rarefied counts result in a highrate of false positives in tests for species that are dif-ferentially abundant across sample classes. Regardingmicrobiome sample-wise clustering, we also show thatthe rarefying procedure often discards samples that canbe accurately clustered by alternative methods. We fur-ther compare different Negative Binomial methods witha recently-described zero-inflated Gaussian mixture, im-plemented in a package called metagenomeSeq. We findthat metagenomeSeq performs well when there is an ad-equate number of biological replicates, but neverthelesstends toward a higher false positive rate, a key trade-offthat may be critical for the goals of different investiga-tions. Based on these results and well-established statis-

tical theory, we advocate that investigators avoid rarefy-ing altogether. We have provided microbiome-specificextensions to these tools in the R package, phyloseq.

Author SummaryThe term microbiome refers to the ecosystem of mi-crobes that live in a defined environment. The decreas-ing cost and increasing speed of DNA sequencing tech-nology has recently provided scientists with affordableand timely access to the genes and genomes of micro-biomes that inhabit our planet and even our own bod-ies. In these investigations many microbiome samplesare sequenced at the same time on the same DNA se-quencing machine, but often result in total numbers ofsequences per sample that are vastly different. Thecommon procedure for addressing this difference in se-quencing effort across samples – different library sizes– is to either (1) base analyses on the proportional abun-dance of each species in a library, or (2) rarefy, throwaway sequences from the larger libraries so that all havethe same, smallest size. We show that both of these nor-malization methods sometimes work acceptably well forthe purpose of comparing entire microbiomes to one an-other, but that neither method works well when com-paring the relative proportions of each bacterial speciesacross microbiome samples. We show that alternativemethods based on a statistical mixture model performvery well, and can be easily adapted from a separate bi-ological sub-discipline, called RNA-Seq analysis.

IntroductionModern, massively parallel DNA sequencing technolo-gies have changed the scope and technique of investi-gations across many fields of biology [1, 2]. In geneexpression studies the standard measurement techniquehas shifted away from microarray hybridization to direct

1

arX

iv:1

310.

0424

v2 [

q-bi

o.Q

M]

12

Dec

201

3

Rarefying Microbiome Data Is Inadmissible 2013 McMurdie and Holmes

sequencing of cDNA, a technique often referred to asRNA-Seq [3]. Analogously, culture independent [4] mi-crobiome research has migrated away from detection ofspecies through microarray hybridization of small sub-unit rRNA gene PCR amplicons [5] to direct sequenc-ing of highly-variable regions of these amplicons [6], oreven direct shotgun sequencing of microbiome metage-nomic DNA [7]. Even though the statistical methodsavailable for analyzing microarray data have maturedto a high level of sophistication [8], these methods arenot directly applicable because DNA sequencing dataconsists of discrete counts of equivalent sequence readsrather than continuous values derived from the fluores-cence intensity of hybridized probes. In recent genera-tion DNA sequencing the total reads per sample (librarysize; sometimes referred to as depths of coverage) canvary by orders of magnitude within a single sequencingrun. Comparison across samples with different librarysizes requires more than a simple linear or logarithmicscaling adjustment because it also implies different lev-els of uncertainty, as measured by the sampling vari-ance of the proportion estimate for each feature (a fea-ture is a gene in the RNA-Seq context, and is a speciesor Operational Taxonomic Unit, OTU, in the context ofmicrobiome sequencing). In this article we are primar-ily concerned with optimal methods for addressing dif-ferences in library sizes from microbiome sequencingdata.

Variation in the read counts of features between tech-nical replicates have been adequately modeled by Pois-son random variables [9]. However, we are usuallyinterested in understanding the variation of featuresamong biological replicates in order to make inferencesthat are relevant to the corresponding population; inwhich case a mixture model is necessary to account forthe added uncertainty [10]. Taking a hierarchical modelapproach with the Gamma-Poisson has provided a sat-isfactory fit to RNA-Seq data [11], as well as a validregression framework that leverages the power of gen-eralized linear models [12]. A Gamma mixture of Pois-son variables gives the negative binomial (NB) distri-bution [10, 11] and several RNA-Seq analysis packagesnow model the counts, K, for gene i, in sample j ac-cording to:

Kij ∼ NB(sjµi, φi) (1)

where sj is a linear scaling factor for sample j thataccounts for its library size, µi is the mean proportionfor gene i, and φi is the dispersion parameter for gene i.

The variance is νi = sjµi +φis2jµ

2i , with the NB distri-

bution becoming Poisson when φ = 0. Recognizing thatφ > 0 and estimating its value is important in gene-leveltests, in order to better control the rate of false positivegenes that test as significantly differentially expressedbetween experimental conditions under the assumptionof a Poisson distribution, but nevertheless fail in teststhat account for non-zero dispersion.

The uncertainty in estimating φi for every gene whenthere is a small number of samples — or a small num-ber of biological replicates — can be mitigated by shar-ing information across the thousands of genes in an ex-periment, leveraging a systematic trend in the mean-dispersion relationship [11]. This approach substan-tially increases the power to detect differences in pro-portions (differential expression) while still adequatelycontrolling for false positives [13]. Many R packagesimplementing this model of RNA-Seq data are nowavailable, differing mainly in their approach to mod-eling dispersion across genes [14]. Although DNAsequencing-based microbiome investigations use thesame sequencing machines and represent the processedsequence data in the same manner — a feature-by-sample contingency table where the features are OTUsinstead of genes — to our knowledge the modelingand normalization methods currently used in RNA-Seqanalysis have not been transferred to microbiome re-search [15–17].

Instead, microbiome analysis workflows often beginwith an ad hoc library size normalization by randomsubsampling without replacement, or so-called rarefy-ing [17–19]. There is confusion in the literature re-garding terminology, and sometimes this normaliza-tion approach is conflated with a non-parametric resam-pling technique — called rarefaction [20], or individual-based taxon re-sampling curves [21] — that can bejustified for coverage analysis or species richness esti-mation in some settings [21], though in other settingsit can perform worse than parametric methods [22].Here we emphasize the distinction between taxon re-sampling curves and normalization by strictly adheringto the terms rarefying or rarefied counts when referringto the normalization procedure, and respecting the orig-inal definition for rarefaction. Rarefying is most oftendefined by the following steps [18].

1. Select a minimum library size, NL,min. This hasbeen called the rarefaction level [17], though wewill not use the term here.

2. Discard libraries (samples) that have fewer reads

2


than NL,min.

3. Subsample the remaining libraries without replace-ment such that they all have size NL,min.

Often NL,min is chosen to be equal to the size of thesmallest library that is not considered defective, andthe process of identifying defective samples comes witha risk of subjectivity and bias. In many cases re-searchers have also failed to repeat the random subsam-pling step or record the pseudorandom number gener-ation seed/process — both of which are essential forreproducibility. To our knowledge, rarefying was firstrecommended for microbiome counts in order to mod-erate the sensitivity of the UniFrac distance [23] to li-brary size, especially differences in the presence of rareOTUs [24]. In these and similar studies the princi-pal objective is an exploratory/descriptive comparisonof microbiome samples, often from different environ-mental/biological sources; a research task that is be-coming increasingly accessible with declining sequenc-ing costs and the ability to sequence many samplesin parallel using barcoded primers [25, 26]. Rarefy-ing is now an exceedingly common precursor to micro-biome multivariate workflows that seek to relate sam-ple covariates to sample-wise distance matrices [19, 27,28]; for example, integrated as a recommended optionin QIIME’s [29] beta_diversity_through_-plots.py workflow, in Sub.sample in the mothursoftware library [30], in daisychopper.pl [31],and is even supported in phyloseq’s rarefy_even_-depth function [32] (though not recommended in itsdocumentation). The perception in the microbiome lit-erature of “rarefying to even sampling depth” as a stan-dard normalization procedure appears to explain whyrarefied counts are also used in studies that attempt todetect differential abundance of OTUs between prede-fined classes of samples [33–37], in addition to studiesthat use proportions directly [38]. It should be noted thatwe have adopted the recently coined term differentialabundance [39,40] as a direct analogy to differential ex-pression from RNA-Seq. Like differentially expressedgenes, a species/OTU is considered differentially abun-dant if its mean proportion is significantly different be-tween two or more sample classes in the experimentaldesign.

Statistical motivationDespite its current popularity in microbiome analysesrarefying biological count data is statistically inad-

missible because it requires the omission of availablevalid data. This holds even if repeated rarefying trialsare compared for stability as previously suggested [17].In this article we demonstrate the applicability of a vari-ance stabilization technique based on a mixture modelof microbiome count data. This approach simultane-ously addresses both problems of (1) DNA sequencinglibraries of widely different sizes, and (2) OTU (feature)count proportions that vary more than expected undera Poisson model. We utilize the most popular imple-mentations of this approach currently used in RNA-Seqanalysis, namely edgeR [41] and DESeq [13], adaptedhere for microbiome data. This approach allows validcomparison across OTUs while substantially improvingboth power and accuracy in the detection of differen-tial abundance. We also compare the Gamma-Poissonmixture model performance against a method that mod-els OTU proportions using a zero-inflated Gaussian dis-tribution, implemented in a recently-released packagecalled metagenomeSeq [40].


Tables

Table 1. A minimal example of the effect of rarefying on power.Original Abundance

A BOTU1 62 500OTU2 38 500Total 100 1000

Rarefied AbundanceA B

OTU1 62 50OTU2 38 50

100 100Standard Tests for Difference

P-value �2 Prop FisherOriginal 0.0290 0.0290 0.0272Rarefied 0.1171 0.1171 0.1169

Hypothetical abundance data in its original (Top-Left) and rarefied (Top-Right) form, with corresponding formal test results for differentiation(Bottom).

21

Table 1. A minimal example of the effect of rarefying on power.Hypothetical abundance data in its original (Top-Left) and rarefied(Top-Right) form, with corresponding formal test results for differ-entiation (Bottom).

A mathematical proof of the sub-optimality of thesubsampling approach is presented in the supplemen-tary material (Text S1). To help explain why rarefyingis statistically inadmissible, especially with regards tovariance stabilization, we start with the following min-imal example. Suppose we want to compare two dif-ferent samples, called A and B, comprised of 100 and1000 DNA reads, respectively. In statistical terms, theselibrary sizes are also equivalent to the number of tri-als in a sampling experiment. In practice, the librarysize associated with each biological sample is a randomnumber generated by the technology, often varying fromhundreds to millions. For our example, we imagine thesimplest possible case where the samples can only con-tain two types of microbes, called OTU1 and OTU2.

3


The results of this hypothetical experiment are repre-sented in the Original Abundance section of Table 1.Formally comparing the two proportions according to astandard test could technically be done either using a χ2

test (equivalent to a two sample proportion test here) ora Fisher exact test. By first rarefying (Table 1, RarefiedAbundance section) so that both samples have the samelibrary size before doing the tests, we are no longer ableto differentiate the samples (Table 1, tests). This lossof power is completely attributable to reducing the sizeof B by a factor of 10, which also increases the confi-dence intervals corresponding to each proportion suchthat they are no longer distinguishable from those in A,even though they are distinguishable in the original data.

The variance of the proportion’s estimate p is multi-plied by 10 when the total count is divided by 10. Inthis binomial example the variance of the proportion es-timate is V ar(Xn ) = pq

n = qnE(Xn ), a function of the

mean. This is a common occurrence and one that is tra-ditionally dealt with in statistics by applying variance-stabilizing transformations. We show in Text S1 that therelation between the variance and the mean for micro-biome count data can be estimated and the model usedto find the optimal variance-stabilizing transformation.As illustrated by this simple example, it is inappropri-ate to compare the proportions of OTU i, pi = Kij/sj ,without accounting for differences in the denominatorvalue (the library size, sj) because they have unequalvariances. This problem of unequal variances is calledheteroscedasticity. In other words, the uncertainty as-sociated with each value in the table is fundamentallylinked to the total number of observations (or reads),which can vary even more widely than a 10-fold differ-ence. In practice we will be observing hundreds of dif-ferent OTUs instead of two, often with dependendencybetween the counts. Nevertheless, the difficulty causedby unequal library sizes still pertains.

The uncertainty with which each proportion is es-timated must be considered when testing for a differ-ence between proportions (one OTU), or sets of pro-portions (a microbial community). Although rarefy-ing does equalize variances, it does so only by inflat-ing the variances in all samples to the largest (worst)value among them at the cost of discriminating power(increased uncertainty). Rarefying adds additional un-certainty through the random subsampling step, suchthat Table 1 shows the best-case, approached only with asufficient number of repeated rarefying trials (See Pro-tocol S2, minimal example). In this sense alone, the ran-

dom step in rarefying is unnecessary. Each count valuecould be transformed to a common-scale by roundingKijsmin/sj . Although this common-scale approach isan improvement over the rarefying method here defined,both methods suffer from the same problems related tolost data.

Common Scale Rarefied

1e+02

1e+05

1e+08

1e+02

1e+05

1e+08

1e+02

1e+05

1e+08

DietP

atternsG

lobalPatterns

GlobalP

atterns

2004 HighFat

freshwater

marine

10 1000 10 1000

Mean

Var

ianc

e

Figure 1. Overdispersion in Microbiome Data. Common-ScaleVariance versus Mean for Microbiome Data. Each point in eachpanel represents a different OTU’s mean/variance estimate for a bi-ological replicate and study. The data in this figure come fromthe Global Patterns survey [42] and the Long-Term Dietary Patternsstudy [43], with results from many more studies included in Proto-col S2. (Right) Variance versus mean abundance for rarefied counts.(Left) Common-scale variances and common-scale means, estimatedaccording to Equations 7 and 6 from Anders and Huber [13], imple-mented in the DESeq package (Text S1). The dashed gray line denotesthe σ2 = µ case (Poisson; φ = 0). The cyan curve denotes the fit-ted variance estimate using DESeq [13], with method=‘pooled’,sharingMode=‘fit-only’, fitType=‘local’.

4


Materials and MethodsIn order to quantify the relative statistical costs of rare-fying, and to illustrate the relative benefits of an appro-priate mixture model, we created two microbiome sim-ulation workflows based on repeated subsampling fromempirical data. These workflows were organized ac-cording to Figure 2. Because the correct answer in ev-ery simulation is known, we were able to evaluate theresulting power and accuracy of each statistical method,and thus quantify the improvements one method pro-vided over another under a given set of conditions.In both simulation types we varied the library and ef-fect sizes across a range of levels that are relevant forrecently-published microbiome investigations, and fol-lowed with commonly used statistical analyses from themicrobiome and/or RNA-Seq literature (Figure 2).

Simulation A

Simulation A is a simple example of a descriptive ex-periment in which the main goal is to distinguish pat-terns of relationships between whole microbiome sam-ples through normalization followed by the calculationof sample-wise distances. Many early microbiome in-vestigations are variants of Simulation A, and also usedrarefying prior to calculating UniFrac distances [27].Microbiome studies have often graphically representedthe results of their pairwise sample distances using mul-tidimensional scaling [44] (also called Principal Coor-dinate Analysis, PCoA), which is useful if the desiredeffects are clearly evident among the first two or threeordination axes. In some cases, formal testing of samplecovariates is also done using a permutation MANOVA(e.g. vegan::adonis in R [45]) with the (squared)distances and covariates as response and linear predic-tors, respectively [46]. However, in this case we arenot interested in creating summary graphics or testingthe explanatory power of sample covariates, but ratherwe are interested in precisely evaluating the relative dis-criminating capability of each combination of normal-ization method and distance measure. We will use clust-ering results as a quantitative proxy for the broad spec-trum of approaches taken to interpret microbiome sam-ple distances.

Normalizations in Simulation A. For each simu-lated experiment we used the following normalizationmethods prior to calculating sample-wise distances.

1. DESeqVS. Variance Stabilization implemented in

the DESeq package [13].

2. None. Counts not transformed. Differences in totallibrary size could affect the values of some distancemetrics.

3. Proportion. Counts are divided by total librarysize.

4. Rarefy. Rarefying is performed as defined inthe introduction, using rarefy_even_depthimplemented in the phyloseq package [32], withNL,min set to the 15th-percentile of library sizeswithin each simulated experiment.

5. UQ-logFC. The Upper-Quartile Log-Fold Changenormalization implemented in the edgeR pack-age [41], coupled with the top-MSD distance (seebelow).

Distances in Simulation A. For each of the previousnormalizations we calculated sample-wise distance ma-trices using the following distance metrics, if applicable.

1. Bray-Curtis. The Bray-Curtis distance first de-fined in 1957 for forest ecology [47].

2. Euclidean. The euclidean distance treatingeach OTU as a dimension. This has the form√∑n

i=1(Ki1 −Ki2)2, for the distance betweensamples 1 and 2, with K and i as defined in theIntroduction and n the number of distinct OTUs.

3. PoissonDist. Our abbreviation ofPoissonDistance, a sample-wise distanceimplemented in the PoiClaClu package [48].

4. top-MSD. The mean squared difference of topOTUs, as implemented in edgeR [41].

5. UniFrac-u. The Unweighted UniFrac dis-tance [23].

6. UniFrac-w. The Weighted UniFrac distance [49].

In order to consistently evaluate performance in thisregard, we created a simulation framework in whichthere are only two templates and each microbiome sam-ple is drawn from one of these templates by samplingwith replacement. The templates originate from theOcean and Feces samples of the Global Patterns em-pirical dataset [42]. These two datasets were chosen be-cause they have negligible overlapping OTUs, allowingus to modify the severity of the difference between the

5


OTU

s

Repeat mixture and count simulation for each replicate.

samples

OTU

s

count matrix

count matrix

Ocean Feces

OTU

s

Ocean Feces

Sum across samples for each environment

Simulate mixing; add total/m counts from Ocean to Feces, and vice versa. O

TUs

Ocean Feces

Simulate microbiome counts by repeated sampling from these multinomials

Microbiome count data from the Global Patterns dataset

samples

simulated Oceancounts

simulated Fecescounts

Microbiome Clustering Simulationsamples

OTU

s

count matrix

Environment

OTU

s

Sum across samples for each environment

Simulate microbiome counts by repeated sampling from multinomial.

Repeat simulation of samples and random effect for each replicate.

samples

OTU

s 2NS samples drawn from same multinomial

Differential Abundance Simulation

OTU

s NS simulated testsamples

Randomly select 30 OTUs, multiply the counts by the effect-size, m, in test samples only

Perform clustering, evaluate accuracy.

Simulated Experiment

Simulated Experiment

Repeat for each environment type in Global Patterns dataset.

Perform differential abundance tests, evaluate performance.

NS simulated nullsamples

A B

Figure 2. Graphical summary of the two simulation frameworks. Both Simulation A (clustering) and Simulation B (differential abundance)are represented. All simulations begin with real microbiome count data from a survey experiment referred to here as “the Global Patternsdataset” [42]. A rectangle with tick marks and index labels (top or left) represents an abundance count matrix (“OTU table”), while a muchthinner rectangle with only OTU tick marks represents a multinomial of OTU counts/proportions. In both simulation designs, the variable mis used to refer to the effect size, but its meaning is different in each simulation. Small stars emphasize a multinomial or sample in which aperturbation (our effect) has been applied.

templates by randomly mixing a proportion of countsbetween the Ocean and Feces data prior to generatinga set of samples for each simulated experiment. Thismixing step allows arbitrary control over the difficultyof the sample classification task from trivial (no mix-ing) to impossible (evenly mixed). Unsupervised clas-sification was performed independently for each combi-nation of simulated experiment, normalization method,and distance measure using partitioning around medoids(PAM [50, 51], an alternative to k-means that is consid-ered more robust) with the number of classes fixed attwo. The accuracy in the classification results was de-fined as the fraction of simulated samples correctly clas-sified; with the worst possible accuracy being 50% if allsamples are given a classification. Note that the rare-fying procedure omits samples, so its accuracy can bebelow 50% under this definition.

The number of samples to include for each templatein Simulation A was chosen arbitrarily after some explo-ration of preliminary simulations. It was apparent thatthe classification results from Simulation A were mostinformative when we included enough samples per sim-ulated experiment to achieve reproducible results, but

not so many that it was experimentally unrealistic andprohibitively slow to compute. Conversely, the prelim-inary classification results from Simulation A that in-cluded only a few samples per experiment presented alarge variance on each performance measure that wasdifficult to interpret.

Simulation B

Simulation B is a simple example of microbiome ex-periments in which the goal is to detect microbes thatare differentially abundant between two pre-determinedclasses of samples. This experimental design appearsin many clinical settings (health/disease, target/control,etc.), and other settings for which there is sufficient apriori knowledge about the microbiological conditions,and we want to enumerate the OTUs that are differ-ent between these microbiomes, along with a measureof confidence that the proportions differ. For this classof analysis, we simulated microbiome samples by sam-pling with replacement from a single empirical sourceenvironment in the Global Patterns dataset. The sam-ples were divided into two equally-sized classes, target

6


and control, and a perturbation was applied (multipli-cation by a defined value) to the count values of a ran-dom subset of OTUs in the target samples. Each of therandomly perturbed OTUs is differentially abundant be-tween the classes, and the performance of downstreamtests can be evaluated on how well these OTUs are de-tected without falsely selecting OTUs for which no per-turbation occurred (false positives). This approach forgenerating simulated experiments with a defined effectsize (in the form of multiplicative factor) was repeatedfor each combination of median library size, number ofsamples per class, and the nine available source envi-ronments in the Global Patterns dataset. Each simu-lated experiment was subjected to various approachesfor normalization/noise-modeling and differential abun-dance testing. False negatives are perturbed OTUs thatwent undetected, while false positives are OTUs thatwere labeled significantly differentially abundant by atest, but were actually unperturbed and therefore had thesame expected proportion in both classes.

Normalization/Modeling in Simulation B. For eachsimulated experiment, we used the following normaliza-tion/modeling methods prior to testing for differentialabundance.

1. Model/None. A parametric model was applied tothe data, or, in the case of the t-test, no normaliza-tion was applied (note: the t-test without normal-ization can only work with a high degree of balancebetween classes, and is provided here for compari-son but is not recommended in general).

2. Rarefied. Rarefying is performed as defined inthe introduction, using rarefy_even_depthimplemented in the phyloseq package [32], withNL,min set to the 15th-percentile of library sizeswithin each simulated experiment.

3. Proportion. Counts are divided by total librarysize.

Testing in Simulation B. For each OTU of each sim-ulated experiment we used the following to test for dif-ferential abundance.

1. two sided Welch t-test. A two-sided t-test withunequal variances, using the mt wrapper in phy-loseq [32] of the mt.maxT method in the multtestpackage [52].

2. edgeR - exactTest. An exact binomial test (seebase R’s stats::binom.test) generalized for

overdispersed counts [11] and implemented in theexactTest method of the edgeR package [41].

3. DESeq - nbinomTest. A Negative Binomial con-ditioned test similar to the edgeR test above, imple-mented in the nbinomTestmethod of the DESeqpackage [13]. See the subsection Testing for differ-ential expression in Anders and Huber, 2010 [13]for the precise definition.

4. DESeq2 - nbinomWaldTest. A Negative Bino-mial Wald Test using standard maximum likeli-hood estimates for GLM coefficients assuming azero-mean normal prior distribution, implementedin the nbinomWaldTest method of the DESeq2package.

5. metagenomeSeq - fitZig. An Expectation-Maximization estimate of the posterior probabil-ities of differential abundance based on a ZeroInflated Gaussian model, implemented in thefitZig method of the metagenomeSeq pack-age [40].

All tests were corrected for multiple inferences using theBenjamini-Hochberg method to control the False Dis-covery Rate [53]. It should be noted that the library sizesfor both categories of simulation were sampled fromthe original distribution of library sizes in the GlobalPatterns dataset, and then scaled according to the pre-scribed median library size of each simulated experi-ment.

We have included in Protocol S2 the complete sourcecode for computing the survey, simulations, normaliza-tions, and performance assessments described in this ar-ticle. Where applicable, this code includes the RNGseed so that the simulations and random resamplingprocedures can be reproduced exactly. Interested in-vestigators can inspect and modify this code, changethe random seed and other parameters, and observethe results (including figures). For ease of inspection,we have authored the source code in R flavored mark-down [54], through which we have generated HTML5files for each simulation that include our extensive com-ments interleaved with code, results, and both inter-mediate and final figures. Our simulation output canbe optionally-modified and re-executed using the theknit2html function in the knitr package. This func-tion will take the location of the simulation sourcefiles as input, evaluate its R code in sequence, gener-ate graphics and markdown, and produce the complete

7


HTML5 output file that can be viewed in any modernweb browser. These simulations, analyses, and graphicsrely upon the cluster [55], foreach [56], ggplot2 [57],metagenomeSeq [40], phyloseq [32], plyr [58], re-shape2 [59], and ROCR [60] R packages; in additionto the DESeq(2) [13], edgeR [41], and PoiClaClu [48]R packages for RNA-Seq data, and tools available in thestandard R distribution [61]. The Global Patterns [42]dataset included in phyloseq was used as empirical mi-crobiome template data. The code to perform the sur-vey and generate Figure 1 is also included as a RMarkdown source file in Protocol S2, and includes thecode to acquire the data using the phyloseq interface tothe microbio.me/qiime server, a function calledmicrobio_me_qiime.

8

microbio.me/qiime


Bray − Curtis Euclidean PoissonDist top − MSD UniFrac − u UniFrac − w

●● ●

●

●

●

●

●●

●

●●

● ●

●●

●

●

● ●●

●● ●

●

●

●●

●●

●●

●

●●

●●

●

●

●● ●

●

●

● ●●

● ●

●

●

●●

● ● ●

●

●

● ● ●● ●

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

N ~L =

1000N ~

L =2000

N ~L =

10000

1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00 1.15 2.00 3.00

Effect Size

Acc

urac

y

Normalization Method: ● DESeqVS None Proportion Rarefy UQ−logFC

Figure 3. Clustering accuracy in simulated two-class mixing. Partitioning around medoids (PAM) [50, 51] clustering accuracy (verticalaxis) that results following different normalization and distance methods. Points denote the mean values of replicates, with a vertical barrepresenting one standard deviation above and below. Normalization method is indicated by both shade and shape, while panel columns andpanel rows indicate the distance metric and median library size (NL), respectively. The horizontal axis is the effect size, which in this contextis an unmixed factor, the ratio of target to non-target simulated counts between two microbiomes that effectively have no overlapping OTUs(Fecal and Ocean microbiomes in the Global Patterns dataset [42]). Higher values of effect size indicate an easier clustering task. For precisedefinitions of abbreviations see Simulation A of the Materials and Methods section.

9


Results and DiscussionWe performed a survey of publicly available micro-biome count data, to evaluate the variance-mean re-lationship for OTUs among sets of biological repli-cates (Figure 1). In every instance the variances werelarger than could be expected under a Poisson model(overdispersed, φ > 0), especially at larger values ofthe common-scale mean. By definition, these OTUs arethe most abundant, and receive the greatest interest inmany studies. For rarefied counts the absolute scalesare decreased and there are many fewer OTUs that passfiltering, but overdispersion is present in both cases andfollows a clear sample-wide trend. See the dispersionsurvey section of Protocol S2 for many more exam-ples of overdispersed microbiome data than the three in-cluded in Figure 1. The consistent (though non-linear)relationship between variance and mean indicates thatparameters of a NB model, especially φi, can be ad-equately estimated among biological replicates of mi-crobiome data, despite a previous weak assertion to thecontrary [39].

In simulations evaluating clustering accuracy, wefound that rarefying undermined the performance ofdownstream clustering methods. This was the result ofomitted read counts, added noise from the random sam-pling step in rarefying, as well as omitted samples withsmall library sizes that nevertheless were accuratelyclustered by alternative procedures on the same simu-lated data (Figure 3). The extent to which the rarefyingprocedure performed worse depended on the effect-size(ease of clustering), the typical library size of the sam-ples in the simulation, and the choice of threshold for theminimum library size (Figure 4). We also evaluated theperformance of alternative clustering methods, k-meansand hierarchical clustering, on the same tasks and foundsimilar overall results (Protocol S2).

In additional simulations we investigated the depen-dency of clustering performance on the choice of mini-mum library threshold, NL,min. We found that sampleswere trivial to cluster for the largest library sizes usingmost distance methods, even with the threshold set tothe smallest library in the simulation (no samples dis-carded, all correctly clustered). However, at more mod-est library sizes typical of highly-parallel experimen-tal designs the optimum choice of size threshold is lessclear. A small threshold implies retaining more samplesbut with a smaller number of reads (less information)per sample; whereas a larger threshold implies more dis-carded samples, but with larger libraries for the samples

that remain. In our simulations the optimum choice ofthreshold hovered around the 15th-percentile of librarysizes for most simulations and normalization/distanceprocedures (Figure 4), but this value is not generaliz-able to other data. Regions within Figure 4 in which alldistances have converged to the same line (y = 1 − x)are regions for which the minimum library thresholdcompletely controls clustering accuracy (all samples notdiscarded are accurately clustered). Regions to the leftof this convergence indicate a compromise between dis-carding fewer samples and retaining enough counts persample for accurate clustering.

In simulations evaluating performance in the detec-tion of differential abundance, we found an improve-ment in sensitivity and specificity when normalizationand subsequent tests are based upon a relevant mixturemodel (Figure 5). Multiple t-tests with correction formultiple inference did not perform well on this data,whether on rarefied counts or on proportions. A directcomparison of the performance of more sophisticatedparametric methods applied to both original and rarefiedcounts demonstrates the strong potential of these meth-ods and large improvements in sensitivity and specificityif rarefying is not used at all.

In general, the rate of false positives from tests basedon proportions or rarefied counts was unacceptablyhigh, and increased with the effect size. This is an un-desirable phenomenon in which the increased relativeabundance of the true-positive OTUs (the effect) is largeenough that the null (unmodified) OTUs appear signif-icantly more abundant in the null samples than in thetest samples. This explanation is easily verified by thesign of the test statistics of the false positive OTU abun-dances, which was uniformly positive (Protocol S2).Importantly, this side-effect of a strong differential abun-dance was observed rarely in edgeR performance resultsunder TMM normalization (not shown) but not withRLE normalization (shown), and was similarly absent inDESeq(2) results. The false positive rate for edgeR andDESeq(2) was near zero under most conditions, with noobvious correlation between false positive rate and ef-fect size. In most simulations count proportions outper-formed rarefied counts due to better sensitivity, but alsosuffered from a higher rate of false positives at largervalues of effect size (Figure 5, Protocol S2).

The rarefying normalization procedure was associ-ated with performance costs in both sample-clusteringand differential abundance statistical evaluations, enu-merated in the following.

10


ES = 1.15 ES = 1.25 ES = 1.5

● ●● ● ●

●

●

● ●●

●●

●●

●●

●●

●●

●

●

●●

● ●●

●

●●

●●

●

● ●

● ●

●● ● ●

●

●

●●

● ●●

●

●●

●●

●●

●

● ●

●

●●

●

●

● ●

●●

●●

●

●●

●

●●

●

●

●●

●●

●

●

●

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

N ~L =

1000N ~

L =2000

N ~L =

5000N ~

L =10000

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4

Library Size Minimum Quantile

Acc

urac

yDistance Method: ● Bray−Curtis PoissonDist top−MSD UniFrac−u UniFrac−w

Figure 4. Normalization by rarefying only, dependency on library size threshold. Unlike the analytical methods represented in Figure 3,here rarefying is the only normalization method used, but at varying values of the minimum library size threshold, shown as library-size quantile(horizontal axis). Panel columns, panel rows, and point/line shading indicate effect size (ES), median library size (NL), and distance methodapplied after rarefying, respectively. Because discarded samples cannot be accurately clustered, the line y = 1 − x represents the maximumachievable accuracy.

1. Rarefied counts represent only a small fraction ofthe original data, implying an increase in Type-IIerror – often referred to as a loss of power or de-creased sensitivity (Table 1). In sample-wise com-parisons, this lost power is evident through twoseparate phenomena, (1) samples that cannot beclassified because they were discarded, (2) samplesthat are poorly distinguishable because of the dis-carded fraction of the original library (Figure 4).Differential abundance analyses that include mod-erate to rare OTUs are even more sensitive to thisloss of power, where rarefied counts perform worsein every analysis method we attempted (Figure 5,Protocol S2).

2. Rarefied counts remain overdispersed relative to aPoisson model, implying an increase in Type-I er-ror (decreased specificity). Overdispersion is theo-retically expected for counts of this nature, and weunambiguously detected overdispersion in our sur-vey of publicly available microbiome counts (Fig-ure 1). Estimating overdispersion is also more dif-ficult after rarefying because of the lost informa-tion (Figure 5). In our simulations, Type-I error

was much worse for rarefied counts than originalcounts (Figure 5, Protocol S2).

3. Rarefying counts requires an arbitrary selectionof a library size minimum threshold that affectsdownstream inference (Figure 4), but for which anoptimal value cannot be known for new empiricaldata [17].

4. The random aspect of subsampling is unnecessaryand adds artificial uncertainty (Protocol S2, mini-mal example, bottom). A superior transformation(though still inadmissible) is to instead round theexpected value of each count at the new smaller li-brary size, that is ‖ KijNL,min/sj ‖, avoiding theadditional sampling error as well as the need to re-peat the random step [24] and publish the randomseed/process.

Due to these demonstrated limitations and provensub-optimality, we advocate that rarefying should notbe used. In special cases the costs listed above maybe acceptable for sample-comparison experiments inwhich the effect-size(s) and the original library sizes are

11


DESeq2 − nbinomWaldTest DESeq − nbinomTest edgeR − exactTest metagenomeSeq − fitZig two sided Welch t−test

●●●

●

●

●●

●

●

●

●

●

●●●

●●●

●●

●

●

●

●

●●

●

●●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●

●

●

●

●

●●●

●

●

● ●

●

●

●

●

●

●

●

●

●●●

●●

● ●

●

●●

●

●

●

●●

●●● ●

●

●

●

●

●

●

●

●

●

●

●

●●● ●●

●

●

●

●

●

●

●

●

●

●●●● ●●● ●●

●

●●

●

●●

●

●●● ●●● ●●

●

●●

●

●●

●

0.6

0.8

1.0

0.6

0.8

1.0

N ~L =

2000N ~

L =50000

AU

C

Number Samples per Class: 3 5 10 Normalization Method: ● Model/None Rarefied Proportion

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.6

0.8

1.0

N ~L =

50000

Spe

cific

ity

●

●

●

●●

●

●●

●●

●

●

●

●

●

● ●●

● ●

● ● ● ● ●

0.2

0.5

0.8

1.0

N ~L =

50000

5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Effect Size

Sen

sitiv

ity

Figure 5. Performance of differential abundance detection with and without rarefying. Performance summarized here by the “Area Underthe Curve” (AUC) metric of a Receiver Operator Curve (ROC) [60] (vertical axis). Briefly, the AUC value varies from 0.5 (random) to 1.0(perfect), and incorporates both sensitivity and specificity. The horizontal axis indicates the effect size, shown as the actual multiplication factorapplied to the OTU abundances. Each curve traces the respective normalization method’s mean performance of that panel, with a vertical barindicating a standard deviation in performance across all replicates and microbiome templates. The right-hand side of the panel rows indicatesthe median library size, NL, while the darkness of line shading indicates the number of samples per simulated experiment. Color shade andshape indicate the normalization method. See Methods section for the definitions of each normalization and testing method. All P-values wereadjusted for multiple hypotheses using BH [53], and a detection significance threshold of 0.05.

12


large enough to withstand the loss of data. Many earlydescriptive studies fall into this category – for examplecomparing functionally distinct human body sites or en-vironments [42] – and the ability to accurately distin-guish those vastly-different microbiome samples is notin question, even with rarefied counts. However, for newempirical data the effect size(s) are unknown and maybe subtle; and consequently, rarefying may underminedownstream analyses.

In the case of differential abundance detection, itseems unlikely that the cost of rarefying is ever ac-ceptable. In our simulations, both rarefied counts andsample proportions resulted in an unacceptably highrate of false positive OTUs. As we described theoreti-cally in the introduction, this is explained by differencesamong biological replicates that manifest as overdisper-sion, leading to a subsequent underestimate of the truevariance if a relevant mixture model is not used. Wedetected overdispersion among biological replicates inall publicly available microbiome count datasets that wesurveyed (Figure 1, Protocol S2). Failure to accountfor this overdispersion – by using proportions or rar-efied counts – results in a systematic bias that increasesthe Type-I error rate even after correcting for multiple-hypotheses (e.g. Benjamini-Hochberg [53]). In otherwords, if overdispersion has not been addressed, we pre-dict many of the reported differentially abundant OTUsare false positives attributable to an underestimate of un-certainty.

In our simulations this propensity for Type-I error in-creased with the effect size, e.g. the fold-change inOTU abundance among the true-positive OTUs. For rar-efied counts, we also detected a simultaneous increasein Type-II error attributable to the forfeited data. It maybe tempting to imagine that the increased variance esti-mate due to rarefying could be counterbalanced by thevariance underestimate that results from omitting a rel-evant mixture model. However, such a scenario consti-tutes an unlikely special case, and false positives willnot compensate for the false negatives in general. In oursimulations both Type-I and Type-II error increased forrarefied counts (Figure 5, Protocol S2).

Fortunately, we have demonstrated that strongly-performing alternative methods for normalization andinference are already available. In particular, an anal-ysis that models counts with the Negative Binomial –as implemented in DESeq2 [13] or in edgeR [41] withRLE normalization – was able to accurately and specifi-cally detect differential abundance over the full range of

effect sizes, replicate numbers, and library sizes that wesimulated (Figure 5). DESeq-based analyses are rou-tinely applied to more complex tests and experimentaldesigns using the generalized linear model interface inR [62], and so are not limited to a simple two-classdesign. We also verified an improvement in differen-tial abundance performance over rarefied counts or pro-portions by using an alternative mixture model basedon the zero-inflated Gaussian, as implemented in themetagenomeSeq package [40]. However, we did notfind that metagenomeSeq’s AUC values were uniformlyhighest, as Negative Binomial methods had higher AUCvalues when biological replicate samples were low. Fur-thermore, while metagenomeSeq’s AUC values weremarginally higher than Negative Binomial methods atlarger numbers of biological replicates, this was gener-ally accompanied with a much higher rate of false posi-tives (Figure 5, Protocol S2).

Based on our simulation results and the widely en-joyed success for highly similar RNA-Seq data, werecommend using DESeq2 or edgeR to perform anal-ysis of differential abundance in microbiome experi-ments. It should be noted that we did not comprehen-sively explore all available RNA-Seq analysis methods,which is an active area of research. Comparisons ofmany of these methods on empirical [63, 64] and sim-ulated [14, 65, 66] data find consistently effective per-formance for detection of differential expression. Oneminor exception is an increased Type-I error for edgeRcompared to later methods [63], which was also de-tected in our results relative to DESeq and DESeq2when TMM normalization was used (not shown) but notafter switching to RLE normalization (Figure 5, Proto-col S2). Generally speaking, the reported performanceimprovements between these methods are incrementalrelative to the large gains attributable to applying a rel-evant mixture model of the noise with shared-strengthacross OTUs (features). However, some of these alter-natives from the RNA-Seq community may outperformDESeq on microbiome data meeting special conditions,for example a large proportion of true positives and suf-ficient replicates [67], small sample sizes [14], or ex-treme values [68].

Although we did not explore the topic in the sim-ulations here described, a procedure for further im-proving differential expression detection performance,called Independent Filtering [69], also applies to micro-bial differential abundance. Some heuristics for filter-ing low-abundance OTUs are already described in the

13


documentation of various microbiome analysis work-flows [29, 30], and in many cases these can be classi-fied as forms of Independent Filtering. More effort isneeded to optimize Independent Filtering for differen-tial abundance detection, and rigorously define the the-oretical basis and heuristics applicable to microbiomedata. Ideally a formal application of Independent Fil-tering of OTUs would replace many of the current adhoc approaches that often includes poor reproducibilityand justification, as well as the opportunity to introducebias.

Some of the justification for the rarefying procedurehas originated from exploratory sample-wise compar-isons of microbiomes for which it was observed thata larger library size also results in additional obser-vations of rare species, leading to a library size de-pendent increase in both alpha-diversity measures andbeta-diversity dissimilarities [24, 70], especially Uni-Frac [71]. It should be emphasized that this repre-sents a failure of the implementation of these methodsto properly account for rare species and not evidencethat diversity depends on library size. Rarefying is farfrom the optimal method for addressing rare species,even when analysis is restricted solely to sample-wisecomparisons. As we demonstrate here, it is more data-efficient to model the noise and address extra speciesusing statistical normalization methods based on vari-ance stabilization and robustification/filtering. Thoughbeyond the scope of this work, a Bayesian approach tospecies abundance estimation would allow the inclusionof pseudo-counts from a Dirichlet prior that should alsosubstantially decrease this sensitivity.

Our results have substantial implications for pastand future microbiome analyses, particularly regard-ing the interpretation of differential abundance. Mostmicrobiome studies utilizing high-throughput DNAsequencing to acquire culture-independent counts ofspecies/OTUs have used either proportions or rarefiedcounts to address widely varying library sizes. Leftalone, both of these approaches suffer from a failureto address overdispersion among biological replicates,with rarefied counts also suffering from a loss of power,and proportions failing to account for heteroscedastic-ity. Previous reports of differential abundance based onrarefied counts or proportions bear a strong risk of biastoward false positives, and may warrant re-evaluation.Current and future investigations into microbial differ-ential abundance should instead model uncertainty us-ing a hierarchical mixture, such as the Poisson-Gamma

or Binomial-Beta models, and normalization should bedone using the relevant variance-stabilizing transforma-tions. This can easily be put into practice using pow-erful implementations in R, like DESeq2 and edgeR,that performed well on our simulated microbiome data.We have provided wrappers for edgeR, DESeq, DE-Seq2, and metagenomeSeq that are tailored for micro-biome count data and can take common microbiomefile formats through the relevant interfaces in the phy-loseq package [32]. These wrappers are included withthe complete code and documentation necessary to ex-actly reproduce the simulations, analyses, surveys, andexamples shown here, including all figures (Supplemen-tary Information File S2). This example of fully re-producible research can and should be applied to futurepublication of microbiome analyses [72–74].

AcknowledgmentsWe would like to thank the developers of the opensource packages leveraged here for improved insightsinto microbiome data, in particular Gordon Smyth andhis group for edgeR [41], to Mihail Pop and his teamfor metagenomeSeq [40] and Wolfgang Huber and histeam for DESeq and DESeq2 [13]; whose useful doc-umentation and continued support have been invalu-able. The Bioconductor and R teams [61, 75] haveprovided valuable support for our development and re-lease of code related to microbiome analysis in R. Wewould also like to thank Rob Knight and his lab for QI-IME [29], which has drastically decreased the time re-quired to get from raw phylogenetic sequence data toOTU counts. Hadley Wickham created and continuesto support the ggplot2 [57] and reshape [59]/plyr [58]packages that have proven useful for graphical represen-tation and manipulation of data, respectively. RStudioand GitHub have provided immensely useful and freeapplications that were used in the respective develop-ment and versioning of the source code published withthis manuscript.

14

Rarefying Microbiome Data Is Inadmissible References

References[1] Shendure J, Lieberman Aiden E (2012) The ex-

panding scope of DNA sequencing. Naturebiotechnology 30: 1084–1094.

[2] Shendure J, Ji H (2008) Next-generation DNA se-quencing. Nature biotechnology 26: 1135–1145.

[3] Mortazavi A, Williams BA, McCue K, SchaefferL, Wold B (2008) Mapping and quantifying mam-malian transcriptomes by RNA-Seq. Nature meth-ods 5: 621–628.

[4] Pace NR (1997) A molecular view of microbial di-versity and the biosphere. Science 276: 734–740.

[5] Wilson KH, Wilson WJ, Radosevich JL, DeSan-tis TZ, Viswanathan VS, et al. (2002) High-Density Microarray of Small-Subunit RibosomalDNA Probes. Appl Environ Microbiol 68: 2535-2541.

[6] Huse SM, Dethlefsen L, Huber JA, Mark Welch D,Welch DM, et al. (2008) Exploring microbial di-versity and taxonomy using SSU rRNA hypervari-able tag sequencing. PLoS genetics 4: e1000255.

[7] Riesenfeld CS, Schloss PD, Handelsman J (2004)Metagenomics: genomic analysis of microbialcommunities. Annu Rev Genet 38: 525–552.

[8] Allison DB, Cui X, Page GP, Sabripour M (2006)Microarray Data Analysis: from Disarray to Con-solidation and Consensus. Nat Rev Genet 7: 55–65.

[9] Marioni JC, Mason CE, Mane SM, Stephens M,Gilad Y (2008) Rna-seq: an assessment of techni-cal reproducibility and comparison with gene ex-pression arrays. Genome research 18: 1509–1517.

[10] Lu J, Tomfohr JK, Kepler TB (2005) Identifyingdifferential expression in multiple SAGE libraries:an overdispersed log-linear model approach. BMCBioinformatics 6: 165.

[11] Robinson MD, Smyth GK (2007) Small-sampleestimation of negative binomial dispersion, withapplications to SAGE data. Biostatistics (Oxford,England) 9: 321–332.

[12] Cameron AC, Trivedi P (2013) Regression analy-sis of count data .

[13] Anders S, Huber W (2010) Differential expressionanalysis for sequence count data. Genome biology11: R106.

[14] Yu D, Huber W, Vitek O (2013) Shrinkage esti-mation of dispersion in Negative Binomial mod-els for RNA-seq experiments with small samplesize. Bioinformatics (Oxford, England) 29: 1275–1282.

[15] Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G(2013) High throughput sequencing methods andanalysis for microbiome research. Journal of mi-crobiological methods .

[16] Segata N, Boernigen D, Tickle TL, MorganXC, Garrett WS, et al. (2013) Computationalmeta’omics for microbial community studies.Molecular systems biology 9: 666.

[17] Navas-Molina JA, Peralta-Sanchez JM, GonzalezA, McMurdie PJ, Vazquez-Baeza Y, et al. (2013)Advancing Our Understanding of the Human Mi-crobiome Using QIIME. Methods in enzymology531: 371–444.

[18] Hughes JB, Hellmann JJ (2005) The application ofrarefaction techniques to molecular inventories ofmicrobial diversity. Methods in enzymology 397:292–308.

[19] Koren O, Knights D, Gonzalez A, Waldron L,Segata N, et al. (2013) A guide to enterotypesacross the human body: meta-analysis of mi-crobial community structures in human micro-biome datasets. PLoS computational biology 9:e1002863.

[20] Sanders HL (1968) Marine benthic diversity: Acomparative study. The American Naturalist 102:pp. 243-282.

[21] Gotelli NJ, Colwell RK (2001) Quantifying bio-diversity: procedures and pitfalls in the measure-ment and comparison of species richness. EcologyLetters 4: 379–391.

[22] Mao CX, Colwell RK (2005) Estimation ofSpecies Richness: Mixture Models, the Role ofRare Species, and Inferential Challenges. Ecology86: 1143–1153.

15


[23] Lozupone C, Knight R (2005) UniFrac: a newphylogenetic method for comparing microbialcommunities. Applied and Environmental Micro-biology 71: 8228–8235.

[24] Lozupone C, Lladser ME, Knights D, StombaughJ, Knight R (2010) UniFrac: an effective distancemetric for microbial community comparison. TheISME Journal .

[25] Hamady M, Walker JJ, Harris JK, Gold NJ, KnightR (2008) Error-correcting barcoded primers forpyrosequencing hundreds of samples in multiplex.Nature Methods 5: 235–237.

[26] Liu Z, DeSantis TZ, Andersen GL, Knight R(2008) Accurate taxonomy assignments from 16SrRNA sequences produced by highly parallel py-rosequencers. Nucleic Acids Research 36: e120.

[27] Hamady M, Lozupone C, Knight R (2009) Fastunifrac: facilitating high-throughput phylogeneticanalyses of microbial communities including anal-ysis of pyrosequencing and phylochip data. TheISME Journal .

[28] Yatsunenko T, Rey FE, Manary MJ, Trehan I,Dominguez-Bello MG, et al. (2012) Human gutmicrobiome viewed across age and geography.Nature 486: 222–227.

[29] Caporaso J, Kuczynski J, Stombaugh J, BittingerK, Bushman F, et al. (2010) QIIME allows analysisof high-throughput community sequencing data.Nature methods 7: 335–336.

[30] Schloss PD, Westcott SL, Ryabin T, Hall JR,Hartmann M, et al. (2009) Introducing mothur:Open-Source, Platform-Independent, Community-Supported Software for Describing and Compar-ing Microbial Communities. Applied and Envi-ronmental Microbiology 75: 7537–7541.

[31] Gilbert JA, Field D, Swift P, Newbold L, OliverA, et al. (2009) The seasonal structure of micro-bial communities in the Western English Channel.Environmental Microbiology 11: 3132–3139.

[32] McMurdie PJ, Holmes S (2013) phyloseq: an Rpackage for reproducible interactive analysis andgraphics of microbiome census data. PLoS ONE8: e61217.

[33] Charlson ES, Chen J, Custers-Allen R, BittingerK, Li H, et al. (2010) Disordered microbial com-munities in the upper respiratory tract of cigarettesmokers. PLoS ONE 5: e15216.

[34] Price LB, Liu CM, Johnson KE, Aziz M, Lau MK,et al. (2010) The effects of circumcision on the pe-nis microbiome. PLoS ONE 5: e8422.

[35] Kembel SW, Jones E, Kline J, Northcutt D, Sten-son J, et al. (2012) Architectural design influencesthe diversity and structure of the built environmentmicrobiome. The ISME Journal 6: 1469–1479.

[36] Flores GE, Bates ST, Caporaso JG, Lauber CL,Leff JW, et al. (2013) Diversity, distribution andsources of bacteria in residential kitchens. Envi-ronmental Microbiology 15: 588–596.

[37] Kang DW, Park JG, Ilhan ZE, Wallstrom G, LabaerJ, et al. (2013) Reduced incidence of prevotella andother fermenters in intestinal microflora of autisticchildren. PLoS ONE 8: e68322.

[38] Segata N, Haake SK, Mannon P, Lemon KP, Wal-dron L, et al. (2012) Composition of the adult di-gestive tract bacterial microbiome based on sevenmouth surfaces, tonsils, throat and stool samples.Genome biology 13: R42.

[39] White JR, Nagarajan N, Pop M (2009) Statisticalmethods for detecting differentially abundant fea-tures in clinical metagenomic samples. PLoS com-putational biology 5: e1000352.

[40] Paulson JN, Stine OC, Bravo HC, Pop M (2013)Differential abundance analysis for microbialmarker-gene surveys. Nature methods advance on-line publication SP - EP -: 1–6.

[41] Robinson MD, McCarthy DJ, Smyth GK (2009)edgeR: a Bioconductor package for differential ex-pression analysis of digital gene expression data.Bioinformatics (Oxford, England) 26: 139–140.

[42] Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. (2011) Global pat-terns of 16S rRNA diversity at a depth of millionsof sequences per sample. Proceedings of the Na-tional Academy of Sciences 108: 4516-4522.

16


[43] Wu GD, Chen J, Hoffmann C, Bittinger K, ChenYY, et al. (2011) Linking long-term dietary pat-terns with gut microbial enterotypes. Science 334:105–108.

[44] Gower JC (1966) Some distance properties of la-tent root and vector methods used in multivariateanalysis. Biometrika 53: 325–338.

[45] Oksanen J, Blanchet FG, Kindt R, Legendre P,O’Hara RB, et al. (2011) vegan: Community Ecol-ogy Package. R package version 1.17-10.

[46] Anderson M (2001) A new method for non-parametric multivariate analysis of variance. Aus-tral Ecology 26: 32–46.

[47] Bray JR, Curtis JT (1957) An Ordination of theUpland Forest Communities of Southern Wiscon-sin. Ecological Monographs 27: 325.

[48] Witten DM (2011) Classification and clustering ofsequencing data using a Poisson model. The An-nals of Applied Statistics 5: 2493–2518.

[49] Lozupone CA, Hamady M, Kelley ST, Knight R(2007) Quantitative and qualitative beta diversitymeasures lead to different insights into factors thatstructure microbial communities. Applied and En-vironmental Microbiology 73: 1576–1585.

[50] Kaufman L, Rousseeuw PJ (1990) Finding Groupsin Data: An Introduction to Cluster Analysis, JohnWiley & Sons, chapter 2.

[51] Reynolds A, Richards G, Iglesia B, Rayward-Smith V (2006) Clustering rules: A comparisonof partitioning and hierarchical clustering algo-rithms. Journal of Mathematical Modelling andAlgorithms 5: 475-504.

[52] Pollard KS, Gilbert HN, Ge Y, Taylor S, DudoitS (2010) multtest: Resampling-based multiple hy-pothesis testing. R package version 2.4.0.

[53] Benjamini Y, Hochberg Y (1995) Controlling theFalse Discovery Rate: A Practical and PowerfulApproach to Multiple Testing. Journal of the RoyalStatistical Society Series B (Methodological) 57:289–300.

[54] Allaire J, Horner J, Marti V, Porte N The mark-down package: Markdown rendering for R. Ac-cessed 2013 March 22. URL http://CRAN.

R-project.org/package=markdown. Rpackage version 0.5.4.

[55] Maechler M, Rousseeuw P, Struyf A, Hubert M,Hornik K (2013) cluster: Cluster Analysis Basicsand Extensions. R package version 1.14.4 — Fornew features, see the ’Changelog’ file (in the pack-age source).

[56] Analytics R (2011) foreach: Foreach looping con-struct for R. R package version 1.3.2.

[57] Wickham H (2009) ggplot2: elegant graphics fordata analysis. Springer New York.

[58] Wickham H (2011) The split-apply-combine strat-egy for data analysis. Journal of Statistical Soft-ware 40: 1–29.

[59] Wickham H (2007) Reshaping data with the re-shape package. Journal of Statistical Software 21:1–20.

[60] Sing T, Sander O, Beerenwinkel N, Lengauer T(2005) ROCR: visualizing classifier performancein R. Bioinformatics (Oxford, England) 21: 3940–3941.

[61] R Development Core Team (2011) R: A Lan-guage and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna,Austria. ISBN 3-900051-07-0.

[62] Hastie TJ, Pregibon D (1992) Generalized linearmodels. In: Chambers JM, Hastie TJ, editors, Sta-tistical Models in S, Chapman & Hall/CRC, chap-ter 6.

[63] Nookaew I, Papini M, Pornputtapong N, Scalci-nati G, Fagerberg L, et al. (2012) A comprehen-sive comparison of RNA-Seq-based transcriptomeanalysis from reads to differential gene expressionand cross-comparison with microarrays: a casestudy in Saccharomyces cerevisiae. Nucleic AcidsResearch 40: 10084–10097.

[64] Bullard J, Purdom E, Hansen K, Dudoit S (2010)Evaluation of statistical methods for normaliza-tion and differential expression in mrna-seq exper-iments. BMC bioinformatics 11: 94.

[65] Sun J, Nishiyama T, Shimizu K, Kadota K (2013)TCC: an R package for comparing tag count datawith robust normalization strategies. BMC Bioin-formatics 14: 219.

17

http://CRAN.R-project.org/package=markdown

http://CRAN.R-project.org/package=markdown


[66] Soneson C, Delorenzi M (2013) A comparisonof methods for differential expression analysis ofRNA-seq data. BMC Bioinformatics 14: 91.

[67] Hardcastle TJ, Kelly KA (2010) baySeq: empir-ical Bayesian methods for identifying differentialexpression in sequence count data. BMC Bioinfor-matics 11: 422.

[68] Ozer HG, Parvin JD, Huang K (2012) DFI: genefeature discovery in RNA-seq experiments frommultiple sources. BMC genomics 13 Suppl 8:S11.

[69] Bourgon R, Gentleman R, Huber W (2010) In-dependent filtering increases detection power forhigh-throughput experiments. Proceedings ofthe National Academy of Sciences of the UnitedStates of America 107: 9546–9551.

[70] Chao A, Chazdon RL, Colwell RK, Shen TJ(2005) A new statistical approach for assessingsimilarity of species composition with incidenceand abundance data - Chao - 2004 - Ecology Let-ters - Wiley Online Library. Ecology Letters .

[71] Schloss PD (2008) Evaluating different ap-proaches that test whether microbial communitieshave the same structure. The ISME Journal 2:265–275.

[72] Gentleman R, Temple Lang D (2004) Statisticalanalyses and reproducible research. BioconductorProject Working Papers : 2.

[73] Peng RD (2011) Reproducible research in compu-tational science. Science 334: 1226-1227.

[74] Donoho DL (2010) An invitation to reproduciblecomputational research. Biostatistics (Oxford,England) 11: 385–388.

[75] Gentleman RC, Carey VJ, Bates DM, Bolstad B,Dettling M, et al. (2004) Bioconductor: open soft-ware development for computational biology andbioinformatics. Genome Biology 5: R80.

[76] Lu J, Tomfohr J, Kepler T (2005) Identifying dif-ferential expression in multiple sage libraries: anoverdispersed log-linear model approach. BMCbioinformatics 6: 165.

[77] Rice JA (2007) Mathematical statistics and dataanalysis. Cengage Learning.

[78] Anscombe FJ (1948) The transformation ofpoisson, binomial and negative-binomial data.Biometrika 35: 246–254.

[79] Berger JO (1985) Statistical decision theory andBayesian analysis. Springer.

[80] Zhou YH, Xia K, Wright FA (2011) A powerfuland flexible approach to the analysis of RNA se-quence count data. Bioinformatics (Oxford, Eng-land) 27: 2672–2678.

18

Text S1: Mathematical Supplement 2013 McMurdie and Holmes

Text S1: Mathematical Supplement

A supplemental appendix of statistical mathematics supporting this article.

In this supplementary material we go over some of the statistical details pertaining to the use of hierarchical mix-ture models such as the Negative Binomial and the Beta Binomial, which are appropriate for addressing additionalsources of variability inherent to microbiome experimental data, while still retaining statistical power. We haveconcentrated our comparison efforts on the Gamma-Poisson mixture model as some authors [76] have remarkedthat this approach seems to be the most statistically robust approach in the sense that the presence of outliers andmodel misspecification does not over-perturb the results. We show how a Negative Binomial distribution can occurin different ways leading to different parameterizations. We then show that there are transformations we can applyto these random variables, such that the transformed data have a variance which is much closer to constant thanthe original. These variance stabilizing transformations lead to more efficient estimators and give better decisionrules than those obtained via the normalization-through-subsampling method known as rarefying.

Two parameterizations of the negative binomialIn classical probability, the negative binomial is often introduced as the distribution of the number of successes ina sequence of Bernoulli trials with probability of success p before the number r failures occur. Thus with the twoparameters r and p, the probability distribution for the negative binomial is given as

X ∼ NB(r; p)

P (X = k) =

(k + r − 1

k

)(1− p)rpk

=Γ(k + r)

k!Γ(r)(1− p)rpk

The mean of the distribution is m = pr1−r and the variance V ar(X) = pr

(1−p)2 . Sometimes the distribution is given

a different parameterization which we use here. This takes as the two parameters: the mean m and r = 1−pp m,

then the probability mass distribution is rewritten:

X ∼ NB(m; r)

P (X = k) =

(k + r − 1

k

)(r

r +m

)r (m

r +m

)k=

Γ(k + r)

k!Γ(r)

(r

r +m

)r (m

r +m

)kThe variance is V ar(X) = m(m+r)

r = m + m2

r , we will also use φ = 1r and call this the overdispersion pa-

rameter, giving V ar(X) = m + φm2. When φ = 0 the distribution of X will be Poisson(m). This is the(mean=m,overdispersion=φ) parametrization we will use from now on.

Negative Binomial as a hierarchical mixture for read countsIn biological contexts such as RNA-seq and microbial count data the negative binomial distribution arises as ahierarchical mixture of Poisson distributions. This is due to the fact that if we had technical replicates with the

1


same read counts, we would see Poisson variation with a given mean. However, the variation among biologicalreplicates and library size differences both introduce additional sources of variability.

To address this, we take the means of the Poisson variables to be random variables themselves having a Gammadistribution with (hyper)parameters shape r and scale p/(1 − p). We first generate a random mean, λ, for thePoisson from the Gamma, and then a random variable, k, from the Poisson(λ). The marginal distribution is:

P (X = k) =

∫ ∞0

Poλ(k)× γ(r, p1−p )

dλ

=

∫ ∞0

λk

k!e−λ × λr−1e−λ

1−pp

( p1−p )rΓ(r)

dλ

=(1− p)r

prk!Γ(r)

∫ ∞0

λr+k−1e−λ/pdλ

=(1− p)r

prk!Γ(r)pr+kΓ(r + k)

=Γ(r + k)

k!Γ(r)pk(1− p)r

Variance Stabilization

Statisticians usually prefer to deal with errors across samples or in regression situations which are independentand identically distributed. In particular there is a strong preference for homoscedasticity (equal variances) acrossall the noise levels. This is not the case when we have unequal sample sizes and variations in the accuracyacross instruments. A standard way of dealing with heteroscedastic noise is to try to decompose the sources ofheterogeneity and apply transformations that make the noise variance almost constant. These are called variancestabilizing transformations.

Take for instance different Poisson variables with mean µi. Their variances are all different if the µi are different.However, if the square root transformation is applied to each of the variables, then the transformed variableswill have approximately constant variance1. More generally, choosing a transformation that makes the varianceconstant is done by using a Taylor series expansion, called the delta method. We will not give the completedevelopment of variance stabilization in the context of mixtures but point the interested reader to the standardtexts in Theoretical statistics such as [77] and one of the original articles on variance stabilization [78]. Anscombeshowed that there are several transformations that stabilize the variance of the Negative Binomial depending onthe values of the parameters m and r, where r = 1

φ , sometimes called the exponent of the Negative Binomial. Forlarge m and constant mφ, the transformation

sinh−1

√(1

φ− 1

2

)x+ 3

81φ −

34

gives a constant variance around 14 . Whereas for m large and 1

φ not substantially increasing, the following simplertransformation is preferable

log

(x+

1

2φ

)These two transformations are actually used in what is often known as a generalized logarithmic transformation

applied in microarray variance stabilizing transformations and RNA-seq normalization [13].

1Actually if we take the transformation x −→ 2√x we obtain a variance approximately equal to 1.

2


Modeling read countsIf we have technical replicates with the same number of reads sj , we expect to see Poisson variation with meanµ = sjui, for each taxa i whose incidence proportion we denote by ui. Thus the number of reads for the sample jand taxa i would be

Kij ∼ Poisson (sjui)

We use the notational convention that lower case letters designate fixed or observed values whereas upper caseletters designate random variables.

For biological replicates within the same group – such as treatment or control groups or the same environments– the proportions ui will be variable between samples. A flexible model that works well for this variability is theGamma distribution, as it has two parameters and can be adapted to many distributional shapes. Call the two pa-rameters ri and pi

1−pi . So that Uij the proportion of taxa i in sample j is distributed according to Gamma(ri,pi

1−pi ).Thus we obtain that the read counts Kij have a Poisson-Gamma mixture of different Poisson variables. As shownabove we can use the Negative Binomial with parameters (m = uisj) and φi as a satisfactory model of thevariability.

Now we can add to this model the fact that the samples belong to different conditions such as treatment andcontrol or different environments. This is done by separately estimating the values of the parameters, for each ofthe different biological replicate conditions/classes. We will use the index c for the different conditions, we thenhave the counts for the taxa i and sample j in condition c having a Negative Binomial distribution withmc = uicsjand φic so that the variance is written

uicsj + φics2ju

2ic (2)

We can estimate the parameters uic and φic from the data for each OTU and sample condition. This is usuallybest accomplished by leveraging information across OTUs – taking advantage of a systematic relationship betweenthe observed variance and mean – to obtain high quality shrunken estimates. The end result provides a variancestabilizing transformation of the data that allows a statistically efficient comparisons between conditions. Thisapplication of a hierarchical mixture model is very similar to the random effects models used in the context ofanalysis of variance. A very complete comparison of this particular choice of Gamma-Poisson mixture to theBeta-Binomial and nonparametric approaches can be found in [14].

By comparison, the procedures involving a systematic downsampling (rarefying) are inadmissible in the statis-tical sense, because there is another procedure that dominates it using a mean squared error loss function. With aBayesian formalism we can show that the hierarchical Bayes model gives a Bayes rule that is admissible [79].

Other mixture modelsIf instead of modeling the read counts one uses the proportions as the random variables, with differing variancesdue to different library sizes, the Beta-Binomial model is the standard approach. This has also been used forRNA-seq data [80] and the package metaStats [39] uses this model although they don’t use variance stabilizingtransformations of the data.

3

Protocol S2: Source Code 2013 McMurdie and Holmes

Protocol S2. A zip file containing all supplementary source files. This includes the Rmd source code, HTML output, and all relateddocumentation and code to completely and exactly recreate every results figure in this article.

4

arxiv:1310.0424v2 [q-bio.qm] 12 dec 2013 · count proportions that vary more than expected under a...

Documents