massively parallel sequencing of pooled dna

Upload: sining-chen

Post on 08-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    1/25

    MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    SAMPLES-THE NEXT GENERATION OF MOLECULAR

    MARKERS

    Authors and affiliations

    Andreas Futschik(1,) and Christian Schlotterer(2)

    (1) : Department of Statistics, University of Vienna, Vienna, Austria(2) : Institut fur Populationsgenetik, Veterinarmedizinische Universitat Wien,Wien, Austria.() : corresponding author.

    1

    Genetics: Published Articles Ahead of Print, published on May 10, 2010 as 10.1534/genetics.110.114397

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    2/25

    2 NGS EXPERIMENTS WITH POOLED SAMPLES

    Abstract

    Next generation sequencing (NGS) is about to revolutionize genetic analysis.Currently NGS techniques are mainly used to sequence individual genomes. Dueto the high sequence coverage required, the costs for population scale analysesare still too high to allow an extension to non-model organisms. Here, we showthat NGS of pools of individuals is often more effective in SNP discovery andprovides more accurate allele frequency estimates, even when taking sequencing

    errors into account. We modify the population genetic estimators Tajimas andWattersons to obtain unbiased estimates from NGS pooling data. Given thesame sequencing effort, the resulting estimators often show a better performancethan those obtained from individual sequencing. Although our analysis also showsthat NGS of pools of individuals will not be preferable under all circumstances, itprovides a cost effective approach to estimate allele frequencies on a genome-widescale.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    3/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 3

    Next generation sequencing (NGS) is about to revolutionize biology. Througha massive parallelization, NGS provides an enormous number of reads, which per-mits sequencing of entire genomes at a fraction of the costs for Sanger sequencing.Hence, for the first time it has become feasible to obtain the complete genomicsequence for a large number of individuals. For several organisms, including hu-man, D. melanogaster and A. thaliana, large re-sequencing projects are well ontheir way. Nevertheless, despite the enormous cost reduction, genome sequencing

    on a population scale is still out of reach for the budget of most laboratories.The extraction of as much statistical information as possible at cost as low aspossible has therefore already attracted considerable interest. See for instanceJiang et al. (2009) for the modeling of sequencing errors and Erlich et al. (2009)for the efficient tagging of sequences.

    Current genome-wide re-sequencing projects collect the sequences individualby individual. In order to obtain full coverage of the entire genome and to havehigh confidence that all heterozygous sites were discovered, it is required thatgenomes are sequenced at a sufficiently high coverage. As many of the readsonly provide redundant information, cost could be reduced by a more effectivesampling strategy.

    In this report, we explore the potential of DNA pooling to provide a morecost-effective approach for SNP discovery and genome wide population genetics.Sequencing a large pool of individuals simultaneously keeps the number of redun-dant DNA reads low, and provides thus an economic alternative to the sequencingof individual genomes. On the other hand, more care has to be taken to establishan appropriate control of sequencing errors. Obviously haplotype information isnot available from pooling experiments, but this will often be outweighed by theincreased accuracy in population genetic inference.

    Focusing on biallelic loci, our analysis shows that with sufficiently large poolsizes, pooling usually outperforms the separate sequencing of individuals, bothfor estimating allele frequencies and inference of population genetic parameters.

    When sequencing errors are not too common, pooling seems also to be a goodchoice for SNP detection experiments. To avoid the additional challenges en-countered with individual sequencing of diploid individuals, we compare poolingwith individual sequencing of haploid individuals. See Lynch (2008, 2009) fora discussion of next generation sequencing of diploid individuals. Our resultsfor the pooling experiments should be also applicable to a diploid setting, as weare just merging pools of size 2 to a larger pool in this case, leading to a poolsize of n = 2nd for nd diploid individuals. In the methods section, we deriveseveral mathematical expressions that permit us to compare pooling with sep-arate sequencing of individuals. These formulas are then applied in the resultssection in order to illustrate the differences in accuracy between the approaches.

    A reader who is only interested in the actual differences under several scenariosmight therefore want to move directly to the results section.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    4/25

    4 NGS EXPERIMENTS WITH POOLED SAMPLES

    Methods

    Throughout, we will consider an individual sequencing project where k indi-viduals are sequenced each with an expected coverage , by which we mean thatany given locus is sequenced times on average. For a comparable pooling exper-iment that involves the same amount of sequencing effort, the expected coveragewill then be k, i.e. any particular locus will be read k times on average fromthe pool consisting ofn individuals. In practice, one might for instance sequence

    each of the k individuals on a separate Illumina lane with coverage . With thesame sequencing effort, the pool could be sequenced on k lanes simultaneously,leading to a total coverage of k.

    For the convenience of the reader, we summarize our notation in Table 1.

    SNP detection. A SNP is detected at a site if the site is polymorphic, i.e. ifat least two alleles A and a are found in the sequenced sample. We will considerSNP detection both in the context of pooling experiments and for individualsequencing. To assess the performance of these two competing scenarios, we willlook both at the power and at the probability of falsely calling a SNP due tosequencing errors.

    Generally speaking, an experimental design that provides high power whilekeeping the probability of incorrectly detecting a SNP small, will be preferable.When individuals are sequenced separately, the probability of sequencing errorsbeing interpreted as true SNPs can be reduced by a sufficiently high expectedcoverage if the genotype of an individual is inferred by the majority of reads.Note, that in the case of diploid individuals, the distinction between sequencingerrors and true SNPs is significantly more complicated. In pooling experiments, asimple way to control the probability of falsely detecting SNPs both in the haploidand in the diploid case is to require a certain minimum number of reads for theminor allele in order to call a SNP. We extend work by Eberle and Kruglyak(2000) on SNP detection, and derive both the power and error rates for pooling

    experiments and for separate sequencing.Separate Sequencing of Individuals. Let MA, (Ma) denote the number oftimes allele A (a) is sequenced. Given that exactly LA = l out ofk individuals inthe sample have an allele of type A, the probability of detecting polymorphism isequal to the probability of reading at least one of the A and one of the remaininga alleles in the sample. Assuming that for each individual the number of reads ata particular locus is Poisson distributed with parameter , the probability of notcovering the SNP locus for an individual is exp(). This leads to the probability

    qc(l ,k,) := (1 [exp()]l)(1 [exp()]kl)

    for getting at least one A and one a read. Notice that for larger values of

    , this probability is nearly one, except for l = 0 or l = k, where qc(l ,k,) = 0.Suppose now that our population size N is fairly large and that the relativefrequency of allele A is p in the population. Then, by conditioning on the number

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    5/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 5

    l of A alleles in the sample, the probability of detecting a SNP is approximately

    (1) q(p,k,) =k1l=1

    qc(l ,k,)

    k

    l

    pl(1 p)kl.

    For large values of , we obtain that

    (2) q(p,k,) 1 pk (1 p)k.

    We will now derive the probability of wrongly detecting a SNP due to sequencingerrors. A natural way to proceed for individual sequencing is to assume that themost frequently read base for an individual is the true one. The probability thatthis leads to the wrong decision depends on the number of reads available forthe locus under investigation, as well as the probability that a single read fora given base is incorrect and furthermore on whether the errors are independent.Concerning the dependence of the reading errors, we consider two extreme sce-narios. The first, more pessimistic scenario, assumes complete dependence suchthat sequencing errors at a given position always lead to the same incorrect base.The second assumes independent errors such that each sequencing error leads to

    an independently chosen wrong base. In this situation, we assume furthermorethat the three possible wrong bases are chosen with the same probability. Weexpect the actual error probabilities somewhere between these scenarios.

    For the dependent case, we obtain by conditioning on the (Poisson) number ofreads for an individual at a locus

    (3) q(d)e (k,,) = 1

    1

    r1

    i>r/2

    r

    i

    i(1 )(ri)

    r

    r!exp()

    k

    .

    In the independent case, an error is made by choosing one of the three incorrectbases at random, each with probability /3. The probability of falsely detecting

    a SNP is(4)

    q(i)e (k,,) = 1

    1

    r1

    3

    i>r/2

    r

    i

    (/3)i(1 (/3))(ri)

    r

    r!exp()

    k

    .

    The resulting error probabilities can be made very small by ensuring a coverage that is large enough. Obviously a more sophisticated rule will be needed whensequencing diploid individuals.Pooling Experiment. We now assume that a pooled sample of size n is se-quenced with the same expected total number k of reads per locus as for sep-

    arate sequencing. Let F(P)(b, ) =b

    i=0

    i

    i! exp() denote the probability thata Poisson random variable with parameter is at most b. Given a frequencyLA = l of A alleles in the pool, we obtain the probability of reading at least one

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    6/25

    6 NGS EXPERIMENTS WITH POOLED SAMPLES

    A and one a allele as

    (5)

    1 F(P)(0,

    lk

    n)

    1 F(P)(0,

    (n l)k

    n)

    .

    Now this leads to the probability of detecting a SNP

    (6)n1

    l=1

    1 F(P)(0,lk

    n)1 F(P)(0,

    (n l)k

    n)

    n

    lpl(1 p)nl

    which occurs with a proportion p in the population.As sequencing errors are common in NGS, they are easily confounded with low

    frequency alleles. A common strategy to reduce the high probability of sequencingerrors is to consider only SNPs that are detected in at least b reads. Requiring aminimum number b of reads in our context, the probability of detecting a SNPchanges to

    (7)n1l=1

    1 F(P)(b 1,

    lk

    n)

    1 F(P)(b 1,

    (n l)k

    n)

    n

    l

    pl(1 p)nl.

    As with individual sequencing, we again derive the probability of wrongly detect-ing a SNP under two scenarios for the sequencing errors.In the dependent scenario, the probability of wrong SNP detection equals the

    probability

    (8) p(d)e (k,,,b) = (1 F(P)(b 1,k))[1 F(P) (0, k(1 ))]

    of making at least b sequencing errors and getting at least one correct read. Ifthe expected number of reads k is fairly large, the term 1 F(P) (0, k(1 ))is very close to one and can be omitted without changing the results much. Withindependent sequencing errors, an upper bound for the probability of falselydetecting a SNP is given by

    (9) p(i)e (k,,,b) = 3

    1 F(P)(b 1,k/3)

    .

    Allele frequency inference. We consider a locus with expected relative fre-quency p in the population. Suppose first that the individuals are sequencedseparately with an expected coverage of . Then the probability that a specificlocus is read for J = j of the k individuals is

    rj,k :=

    k

    j

    (1 e)je(kj).

    Given that reads are available for J = j out of the k individuals, the relativefrequency of A alleles is Rc := MA/j. The variance of Rc can be obtained as

    Var(Rc) = Var

    MA

    J

    = E

    Var

    MA

    J|J

    + Var

    E

    MA

    J|J

    .

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    7/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 7

    Now given J, MA is binomial B(J,p) distributed and Var

    MAJ |J

    = p(1 p)/J.

    This leads to

    E

    Var

    MA

    J|J

    = E(1/J)p(1 p).

    Furthermore E[ MAJ

    |J] = p and therefore Var

    E[ MAJ

    |J]

    = 0. Together

    Var(Rc) = p(1 p)E(1/J) p(1 p)/k.

    We now turn to the pooling experiment, assuming again a population propor-tion, p, ofA alleles. With LA again denoting the number of A alleles in a pooledsample of size n, we assume MA (Ma) reads of the A (a) allele from this sample.This leads to M = MA + Ma reads for the site under investigation.

    The relative frequency of the A allele estimated from the sample is then givenas Rp = MA/M. According to our model M is Poisson Pois(k), and with U =(M, LA), MA|U is binomial B(M, LA/n). We again decompose the variance into

    Var(Rp) = Var

    MAM

    = E

    Var

    MAM

    |U

    + Var

    E

    MAM

    |U

    .

    Now Var MAM |U = 1M LAn nLAn and E MAM |U = LAn . Together, we obtainVar(Rp) = E

    1

    M

    n 1

    np(1 p) +

    p(1 p)

    n.

    In order to see which experimental setup leads to the smaller variance, weconsider the ratio

    (10)Var(Rp)

    Var(Rc)=

    E

    1M

    n1

    n+ 1

    n

    E(1/J).

    It is convenient that the ratio does not depend on the population proportion p ofAalleles anymore. For a large enough expected coverage we get E(1/J) 1/k andE (1/M) 1/(k). Notice that the variance for the pooling experiment increases

    when individuals contribute unequal amounts of probe material. According toour simulations shown in the results section however, this variance componentcan be kept small by choosing pools of large enough size.

    Allele frequency estimators for pooled samples that also take into accountquality scores of the individual reads have been discussed in Holt et al. (2009).The computation of variances for these estimators would depend on the specificassumptions of a probability model for the quality scores.

    Estimating population genetic parameters. Two widely used summary sta-tistics in population genetics are Tajimas and Wattersons . We investigatethe influence of the two sequencing strategies on the accuracy of these summary

    statistics. According to our simulations, both summary statistics show a sig-nificantly smaller variance for pooled samples. However, in particular for smallpools, the estimators show some bias. The reason for the bias is that multiple

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    8/25

    8 NGS EXPERIMENTS WITH POOLED SAMPLES

    reads of the same sequence are entering the normalizing constant as indepen-dently sampled sequences, if the estimators are computed in a standard way forpooled samples. Sequencing errors also lead to bias, and if a minimum minorallele frequency is required in order to make sequencing errors rare, this needs tobe taken into account. For individual sequencing, the effect of omitting single-tons has been studied by Knudsen and Miyamoto (2009) as well as Achaz (2008).Based on the expected values of Tajimas and Wattersons , we introduce

    modified normalizing constants that make the resulting estimators unbiased un-der neutrality. These bias corrected estimators will then be compared with thoseobtained from individual sequencing. (See the RESULTS section.)

    We first derive a bias correction for Tajimas and start by considering alocus for which M reads are available. We do not consider sequencing errors forthe moment, and focus on the bias that is caused by possibly reading the samesequence more than once. Let ij denote the number of differences between thesequences i and j at this locus that are selected randomly with replacement fromthe pool ofn individuals. Now for this locus

    E = Ei=j

    ij /M

    2 = EIJ

    = E[IJ|I = J]P(I = J) + E[IJ|I = J]P(I = J)

    = 0 + P(I = J)

    = n 1

    n.(11)

    Therefore nn1

    will be unbiased, if we neglect sequencing errors. Since thisbias correction only depends on the size n of the pool and not on the coverageby reads, a bias corrected version of Tajimas for the entire sequence can beobtained by adding up individual values of ,l for all loci and then multiplying

    by nn1 , leading to =

    nn1

    l ,l.

    In order to also correct for sequencing errors, two approaches seem feasible. Ifan unbiased estimate for the sequencing errors is available, such an estimate couldbe used to correct . Analogous to Achaz (2008, equation (1)) for the standard

    experimental setup, 2n

    n1 err will be unbiased, if err is an unbiased estimateof the number of reading errors per sequence. Introducing err will obviously addto the variance of the resulting estimator and the overall performance will dependon the accuracy of err. Another way to take into account sequencing errors isto require a minimum minor allele frequency b for including a segregating sitein the analysis, and to ignore sequencing errors subsequently. The idea is that

    sequencing errors will be rare if b is sufficiently large.Again, we first consider a locus for which the coverage is equal to M. Let (b)

    denote the version of Tajimas where the minor allele frequency is required to

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    9/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 9

    be at least b. Notice that (b) = for b = 1. With Km denoting the number of

    sites where the derived allele A has frequency m, (b) may be written as

    (b) =

    M

    2

    1 Mbm=b

    Kmm(M m)

    for a locus for which M reads are available (see section 1.4 in Durrett (2008)).Let cn = 1/n1i=1 i1, and let furthermore XM denote the number of A allelesamong the reads, and Yn the number of A alleles in the pool. Then

    P(XM = m|Yn = r) =

    M

    m

    (r/n)m(1 r/n)Mm

    and under neutrality P(Yn = r) = r1/cn. With cn being the expected number

    of segregating sites in the pool,

    (12) E((b) ) =

    M

    2

    1cn

    Mbm=b

    n1r=1

    m(M m)P(XM = m|Yn = r)P(Yn = r)

    For b = 1, straightforward calculations reproduce (11), i.e.

    E((1) ) = n 1n .

    For b > 1 the sum does not simplify much, but can be computed and turned intothe bias correction factor

    M

    2

    [Mbm=b

    n1r=1

    m(M m)P(XM = m|Yn = r)r1]1.

    However, an accurate approximation for (12) can be obtained by assuming thatn is large compared to M. In this case

    n1

    r=1

    P(XM = m|Yn = r)P(Yn = r) c1n

    1

    mfor 1 m M 1 and therefore

    (13) E((b) ) M 2b + 1

    M 1.

    For b > 1, the resulting simple bias correction factor M1M2b+1

    turns out to providevery good approximations, even if the pool size n is only moderately larger thanthe number of reads M. Indeed, if singletons are omitted (b = 2), then the relativeerror is only 0.4% when M = 10 and n = 20. For n = 200 and M = 50, the errordrops to 0.02% for b = 2 and 4 105% for b = 3. Summarizing, we propose thefollowing bias corrected version of Tajimas :

    (14) (b) =

    n

    n1 for b = 1,

    M2b+1M1

    (b) for b > 1.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    10/25

    10 NGS EXPERIMENTS WITH POOLED SAMPLES

    To obtain an overall estimate based on L loci with possibly unequal coverage Ml(1 l L), simply take the sum over the individually bias corrected estimates

    (15) (b) =L

    l=1

    (b),l .

    Dividing (b) by the total length of the considered sequence, an estimator for the

    scaled mutation parameter per base results.We now derive a bias correction for Wattersons estimator, again first focusingon a locus with coverage M. We consider a version of Wattersons estimator thatrequires a minimum minor allele frequency b. For b = 1 we use all segregatingsites, and versions that protect against sequencing errors can be obtained bychoosing b > 1. Let Sb denote the number of segregating sites found in the Msequence reads from the pool for which the minor allele frequency is at least b.Then

    (16) (b)W :=

    Sb

    M1i=1 1/i

    provides protection against sequencing errors. if b is large enough. Analogous to(12), we obtain that conditional on the number of reads M for the locus

    (17) E((b)W |M) =

    cncM

    [Mbm=b

    n1r=1

    P(XM = m|Yn = r)P(Yn = r)].

    Let F(B)(x,M,p) denote the probability that a binomial random variable X sat-isfies P(X x) for M trials with success probability p. In particular for p = r/n,

    F(B)(x,M,r/n) =x

    i=0

    M

    i

    rn

    i 1

    r

    n

    Mi.

    Recall furthermore that cM =

    M1i=1 i

    1. Then a bias corrected version of (b)Wfor b 1 is given as

    (18) (b)W =

    (b)W cMn1

    r=1 [F(B)(M b,M,r/n) F(B)(b 1,M,r/n)]1r

    .

    As with Tajimas , (b)W can be easily adapted to work with longer sequences.

    For this purpose, partition the sequence into L loci such that for each locus aconstant number of reads Ml is available, and obtain the bias corrected Watterson

    estimate (b)W,l separately for each locus l. Then

    (19) (b)W =

    Ll=1

    (b)W,l ,

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    11/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 11

    provides an estimate of the overall scaled mutation parameter. Dividing (b)W by

    the total length of the considered sequence, an estimator for the scaled mutationparameter per base results.

    Results

    SNP Detection.

    For many biological applications SNP genotyping provides a cost effective ap-proach, and SNP discovery is the first step required. We compared the efficiencyof SNP discovery using an approach in which each individual is sequenced sepa-rately with a pooling approach. The panels of Figure 1 show that the comparativeefficiency of pooling depends both on the expected coverage and on the minimumnumber of reads for allele calling used for error protection. While pooling experi-ments provide a higher probability of SNP detection in most cases, it is expectedto be less efficient, if both the coverage is small and a a high minimum numberof reads is required. This is not entirely unexpected, since an increased numberof reads required for the inference of the minor allele reduces the probability ofdetecting SNPs in a pooling experiment. The higher the expected coverage, the

    more inefficient individual sequencing becomes. As long as not chosen too small,the size of the pool seems to play a less important role. Figure 2 addresses theproblem of wrongly identifying a sequencing error as a SNP. Irrespective of theassumed model of sequencing errors (see methods section for further details), ahigh probability of sequencing errors makes SNP calling from pools highly unre-liable. On the other hand, if sequencing error rates are reduced (e.g. by qualityfiltering), a suitable lower bound on the minimum allele frequency for detectinga SNP makes pooling very reliable for the identification of SNPs. Interestingly,in some cases, we found pooling to result in fewer erroneous SNP calls thanindividual sequencing.Allele frequency inference.

    In population genetics, the allele frequency spectrum is of central interest. Es-timating the allele frequency spectrum of a population is subject to samplingvariation. In an individual based sequencing strategy, most of the sampling vari-ation comes from the selection of individuals used for DNA sequencing. Theadvantage of the pooling approach is that this sampling error can be dramati-cally reduced by including a large number of individuals in the pool. On theother hand, a second level of sampling error arises in the pooling approach fromthe fact that not all chromosomes in the pool are sequenced and some chromo-somes may be sequenced more than once. We start by discussing the situationwhere individuals contribute equal amounts of probe material and refer to thelast paragraph of the section for the case when this assumption is violated.

    In the methods section, we obtained expression (10) for the ratio of the vari-ances of the estimated relative allele frequency both for a pooling experiment Rp(pool size n), and a classical experiment with individual sequencing Rc. For a

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    12/25

    12 NGS EXPERIMENTS WITH POOLED SAMPLES

    large enough expected coverage and with k individuals sequenced, this equa-tion can be approximated by the following quick rule of thumb: Pooling will leadto a smaller variance for those experimental setups that satisfy 1/ + kn < 1 orequivalently n/(n k) < . Thus a case where pooling provides a better estimateof the allele frequency is when the pool contains more than twice the numberof separately sequenced individuals and the coverage per separately sequencedindividual is at least two. For larger pools smaller values of will be sufficient.

    So far we compared the individual based and pooling strategy only for the samenumber of sequenced reads. Alternatively, the superiority of the pooling approachcould be expressed by the reduction of sequencing costs. Figure 3 compares thepooling approach to sequencing of individuals when both methods provide thesame accuracy for allele frequency estimates. Suppose that k individuals aresequenced separately, each at an expected coverage . Then k indicates thecost in single genome sequencing equivalents that results in the same accuracyas sequencing k genomes individually. If, for instance, k = 20 and k = 10, thenpooling would give the same accuracy with half the sequencing effort, correspond-ing to an individual sequencing project with 10 instead of 20 individuals. Figure 3clearly indicates that larger pool sizes increase the advantage of sequencing pools.

    A higher sequence coverage () for sequencing of individuals further improves thecost effectiveness of pooling.In genome-wide association studies (GWAs), the association between allele

    frequencies and traits (diseases) is investigated. A possible approach is to testwhether alleles have different frequencies in two pools that differ with respect tothe trait of interest, (see Sham et al. (2002)). Since the ratio of variances (10)does not depend on the allele frequencies in the sub-populations, the standarddeviation entering the test statistic will differ by the square root of (10) betweena pooling and a classical experimental setup. If the square root of (10) is 1/2(say), the shift of the expected value of the test statistic under the alternativewill be twice as large in a pooling experiment: Overall pooling will be the more

    powerful approach, whenever the variance ratio is smaller than one (see (10)). Itshould be noted however that the variance of the pooling experiment will becomelarger if individuals contribute unequal amounts of probe material. This issuewill be addressed in the last paragraph of this section.Estimating population genetic parameters.

    We now compare the estimation of the scaled mutation parameter using Watter-sons and Tajimas under our two experimental setups. For this purpose, wesimulated 100 samples under neutrality with mutation parameter = 10, usingthe ms software (Hudson, 2002).

    (./ms 500 100 -t 10 > ms.out)

    where 500 is the number of sequences generated. For separate sequencing, we tookrandom sub-samples of size k = 10 from each sample, thus simulating separatesequencing of 10 individuals each with an expected number of reads. With

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    13/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 13

    pooling, we took samples of size n out of the 500 simulated sequences. Fromthis pool, reads were taken independently for each locus l by making a randomnumber of draws Ml with replacement. The quantities Ml were chosen accordingto a Poisson distribution with expected value k.

    For Tajimas , we used the bias correction (14) for individual loci and addedthe estimates across loci using (15)). For Wattersons , the bias has been cor-rected using formula (18) for each locus.

    Neglecting sequencing errors for the moment, it turns out that the poolingapproach with bias correction leads to more accurate estimates of and , pro-vided that the size of the pool is large enough. For small pools, multiple reads ofthe same chromosome become more common, which affects the accuracy of theestimates negatively. (Figure 5.)

    We now investigate the pooling approach when including a protection againstsequencing errors by removing all segregating sites where the minor allele hasfrequency x satisfies x = 1 or alternatively x 2. Again, the normalizingconstants have been adapted in order to avoid bias. Let b denote the minimumrequired minor allele frequency.

    Figure 5 shows the relative advantage of pooling conditional on different min-

    imum minor allele frequencies. Pooling still leads to a decreased variance underneutrality as long as the pool size is large enough. Not unexpectedly, the reduc-

    tion in variance is now somewhat smaller for Wattersons (b)W . The increase in

    the variance of Tajimas (b) is much smaller, since frequency one minor alleles

    receive a low weight in the calculation of .Unequal amounts of probe material. One obvious source of error in thepooling approach is the heterogeneity in DNA amounts due to measurementerrors. In experiments that rely on PCR amplification, the heterogeneity canbe expected to be particularly strong.

    Individuals for which a larger DNA amount has been included in the DNApool, will be over-represented, which potentially causes a change in allele fre-

    quency estimates. This affects the bias and the variance also for our consideredpopulation genetic summary statistics.

    To investigate the sensitivity of population genetic estimates based on pool-ing experiments, we simulated a scenario involving unequal amounts of probematerial. We set the expected amount of probe material to one, and allowed forlog-normally distributed multiplicative deviations from this expected value. Morespecifically, the deviation factors have been chosen independently for each indi-vidual contributing to the pool according to exp(Xi), where Xi (1 i n) arenormal N(0, log(scale)) random variables. Thus the median amount of probe ma-terial is always equal to one. If the deviation factor has a value of exp(Xi) = 1.5,this means that the respective individual will have a 50% higher chance of being

    sequenced, than another with a factor of exp(Xi) = 1. Similarly, a value of 0.8means a 20% decreased chance of being read.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    14/25

    14 NGS EXPERIMENTS WITH POOLED SAMPLES

    As our first scenario (scale = 2), slightly more than 30% of all individualsdiffered at least twofold from the median. In other words, for a pool of sizen = 100, the most abundant individual contributed about sixteen times the probematerial of the least abundant individual. We also simulated a more extremescenario (scale = 8) where about 30% of the individuals differed at least eightfoldfrom the median. As further parameters we chose = 30, k = 10, n [5, 200].

    As the amount of heterogeneity in the sample will usually be unknown, we

    applied the same bias correction as for equal amounts of probe material. Wemeasured the deviation from the true by the mean squared error, as this ac-counts for bias and variance.

    Figure 6 displays the effect of heterogeneity in probe material on the bias andthe variance of Tajimas and Wattersons . Although both bias and variancechange noticably for higher levels of heterogeneity, these effects cancel out toa large extent. Thus the overall performance measured in terms of the meansquared error

    (20) MSE = Bias2 + Var

    changes only marginally even for a large level of heterogeneity (scale = 8), see

    Figure 7. This effect can be explained by shrinkage that leads to improved es-timates of the mutation parameter by permitting for some bias (Futschik andGach, 2008). Interestingly, even for a high level of heterogeneity in probe mate-rial (scale =8), the performance (measured in terms of the MSE) changes onlymarginally.

    Heterogeneity in probe material also affects the accuracy of the estimated allelefrequencies, as the variance of the estimator based on a pooled sample becomeslarger. However, this effect can be kept small, by choosing a pool of a largeenough size. This is illustrated in Figure 8, where it can be seen that poolingleads for large enough pool sizes eventually to smaller variances even for a highlevel of heterogeneity in probe material (scale =8).

    Discussion

    Over the past decades we have been witnessing a continuous turnover of molec-ular markers used in genetical research. To a large extent this turnover has beendriven by the advances in molecular biology and technology. With the arrival ofthe second generation sequencing technologies this race is about to come to anend - rather than relying on a more or less representative fraction of the genome,it has come into reach to have full genomic sequences available for multiple indi-viduals.

    With further technological advances, it is anticipated that it will become pos-

    sible to sequence individual genomes at a cost that allows even small laboratoriesto perform population analyses on a genome scale. Currently, this is not pos-sible as the costs are still too high. In this study, we showed that sequencing

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    15/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 15

    pools of individuals provides an excellent alternative that permits genome widepolymorphism surveys at very moderate costs.

    This is the first report systematically exploring the parameter range for whichDNA pooling provides an advantage compared to individual genome sequencing.

    Our result that NGS of DNA pools often provides a reliable and cost effectivemean for genome-wide allele frequency estimates, is supported by some recentstudies using NGS to analyze DNA pools of selected genomic regions. (Van Tas-

    sell et al., 2008) sequenced a complexity reduced DNA pool using the IlluminaGenome Analyzer. For a subset of the identified SNPs, they compared the allelefrequency estimates from the Illumina sequencing to those obtained by genotypingthe same individuals. Despite that SNP frequency estimates were undoubtedlyaffected by a substantial assignment error (Palmieri and Schlotterer, 2009) dueto the short reads and the complexity reducing procedure, (Van Tassell et al.,2008) observed a correlation of 0.67 between the two methods. Hence, there isvery little doubt that NGS is an effective tool to provide accurate genome-wideallele frequency estimates from DNA pools.

    We anticipate that the analysis of DNA pools provides a wide range of ap-plications. In population genetics, it will be possible to compare patterns ofdifferentiation on a genomic scale. Thus, patterns of local adaptation and het-erogeneity in gene flow among different genomic regions can be identified. Alsofor association mapping DNA pools are very powerful (Sham et al., 2002). In con-trast to SNP arrays, however, re-sequencing of DNA pools will always include thecausative SNP and thus provide a higher statistical power. Our study providesthe basis for an adequate experimental design of future pooling experiments.

    ACKNOWLEDGEMENTS

    This work has been supported by a WWTF grant to AF and CS as well as FWF

    Grants (P 19832-B11, L403-B11) awarded to CS. Special thanks to C. Kosiol, N.de Maio, and R. Kofler for helpful comments on earlier versions of the manuscript.We are grateful to A. Vasemagi and J. Wolf on general discussions about poolingfor NGS, and also thank the reviewers for helpful comments.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    16/25

    16 NGS EXPERIMENTS WITH POOLED SAMPLES

    References

    Achaz, G., 2008. Testing for neutrality in samples with sequencing errors. Genet-ics 179, 14091424.

    Durrett, R., 2008. Probabiliy models for DNA Sequence Evolution. Springer, NewYork.

    Eberle, M., Kruglyak, L., 2000. An analysis of strategies for discovery of single-nucleotide polymorphisms. Genet. Epidemiol. 19, S29S35.

    Erlich, Y., Chang, K., Gordon, A., Ronen, R., Navon, O., Rooks, M., Han-non, G. J., 2009. Dna sudoku-harnessing high-throughput sequencing for mul-tiplexed specimen analysis. Genome Research 19, 12431253.

    Futschik, A., Gach, F., 2008. On the inadmissibility of wattersons estimate.Theoretical Population Biology 73, 212221.

    Holt, K., Teo, Y., Li, H., Nair, S., Dougan, G., Wain, J., Parkhill, J., 2009.Detecting snps and estimating allele frequencies in clonal bacterial populationsby sequencing pooled dna. Bioinformatics 25, 20742075.

    Hudson, R. R., 2002. Generating samples under a wright-fisher neutral model ofgenetic variation. Bioinformatics 18, 337338.

    Jiang, R., Tavare, S., Marjoram, P., 2009. Population genetic inference from

    resequencing data. Genetics 181, 187197.Knudsen, B., Miyamoto, M. M., 2009. Accurate and fast methods to estimate

    the population mutation rate from error prone sequences. BMC Bioinformatics10:247, doi:10.1186/1471210510247.

    Lynch, M., 2008. Estimation of nucleotide diversity, disequilibrium coefficients,and mutation rates from high-coverage genome-sequencing projects. Mol. Biol.Evol. 25, 24212431.

    Lynch, M., 2009. Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182, 295301.

    Palmieri, N., Schlotterer, C., 2009. Mapping accuracy of short reads from mas-sively parallel sequencing and the implications for quantitative expression pro-

    filing. PloS one 4, e6323+.Sham, P., Bader, J. S., Craig, I., M., O., M., O., 2002. Dna pooling: A tool for

    large-scale association studies. Nature Rev. Genet. 3, 862871.Van Tassell, C. P. P., Smith, T. P. L. P., Matukumalli, L. K. K., Taylor, J.

    F. F., Schnabel, R. D. D., Lawley, C. T. T., Haudenschild, C. D. D., Moore, S.S. S., Warren, W. C. C., Sonstegard, T. S. S., 2008. Snp discovery and allelefrequency estimation by deep sequencing of reduced representation libraries.Nat. Methods.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    17/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 17

    Table 1. Description of our Notation

    Symbol or Notation Descriptionk number of haploid individuals used for separate sequencing expected number of times a locus is read for an individual using

    separate sequencingn size of the pool in a pooling experimentJ random number of individuals for which reads are actually available

    at a particular locus with individual sequencing (J k)M random number of reads for a particular locus in a pooling experi-

    ment (E(M) = k)p relative frequency of the allele of interest in the populationF(P)(b, ) Poisson cumulative distribution function (F(P)(b, ) =b

    i=0i

    i!exp())

    F(B)(x,M,p) binomial cumulative distribution function (F(B)(x,M,p) =xi=0

    Mi

    pi(1 p)Mi.

    (b) bias corrected version of Tajimas for a pooling experiment when

    the minor allele frequency is required to be at least b. For b = 1,(b) = .

    (b)W bias corrected version of Wattersons for a pooling experiment

    when the minor allele frequency is required to be at least b 1.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    18/25

    18 NGS EXPERIMENTS WITH POOLED SAMPLES

    0.0 0.1 0.2 0.3 0.4 0.5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    A: =5, n=50

    freq in pop.

    probofSNPd

    etection

    0.0 0.1 0.2 0.3 0.4 0.5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    A: =5, n=200

    freq in pop.

    probofSNPd

    etection

    0.0 0.1 0.2 0.3 0.4 0.5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    B: =10, n=50

    freq in pop.

    probofSNPd

    etection

    0.0 0.1 0.2 0.3 0.4 0.5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    B: =10, n=200

    freq in pop.

    probofSNPd

    etection

    0.0 0.1 0.2 0.3 0.4 0.5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C: =20, n=50

    freq in pop.

    probofSNPd

    etection

    0.0 0.1 0.2 0.3 0.4 0.5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    C: =20, n=200

    freq in pop.

    probofSNPd

    etection

    Figure 1. Probability of detecting a SNP with relative minor al-lele frequency p in the population when a certain minimum numberof reads is required as a detection threshold. The colored lines in-dicate the probabilities for sequencing experiments using a pooledsample (purple-dashed: no error correction, red-dotted: minor al-lele frequency (m.a.f.) at least 2, blue-dash-dot: m.a.f. 4, greenlong dashed: : m.a.f. 6). Solid black line: Experiment where

    k = 10 haploid individuals are sequenced separately. Expectedcoverage = 5(A), 10(B), 20(C) per individual. For pooling ex-periments, the expected total coverage is k. Pool sizes are either50 (left column) or 200 (right column).

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    19/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 19

    4.0 3.5 3.0 2.5 2.0 1.5 1.0

    8

    6

    4

    2

    0

    A: =5, errors dependent

    log_10 sequencing error prob.

    l

    og_

    10

    prob.

    offalseSNPc

    alling

    4.0 3.5 3.0 2.5 2.0 1.5 1.0

    8

    6

    4

    2

    0

    A: =5, errors i.i.d.

    log_10 sequencing error prob.

    l

    og_

    10

    prob.

    offalseSNPc

    alling

    4.0 3.5 3.0 2.5 2.0 1.5 1.0

    8

    6

    4

    2

    0

    B: =10, errors dependent

    log_10 sequencing error prob.

    log_

    10

    prob.

    offalseSNPc

    alling

    4.0 3.5 3.0 2.5 2.0 1.5 1.0

    8

    6

    4

    2

    0

    B: =10, errors i.i.d.

    log_10 sequencing error prob.

    log_

    10

    prob.

    offalseSNPc

    alling

    4.0 3.5 3.0 2.5 2.0 1.5 1.0

    8

    6

    4

    2

    0

    C: =20, errors dependent

    log_10 sequencing error prob.

    log_

    10

    prob.

    offalseSNPc

    alling

    4.0 3.5 3.0 2.5 2.0 1.5 1.0

    8

    6

    4

    2

    0

    C: =20, errors i.i.d.

    log_10 sequencing error prob.

    log_

    10

    prob.

    offalseSNPc

    alling

    Figure 2. Log-probability of falsely detecting a SNP at a non-segregating site, in dependance on the logarithm of the sequencingerror probability. The colored lines indicate the probabilities forsequencing experiments using a pooled sample (purple-dashed: noerror correction, red-dotted: minor allele frequency (m.a.f.) at least2, blue-dash-dot: m.a.f. 4, green long dashed: : m.a.f. 6).Solid black line: Experiment where k = 10 haploid individuals are

    sequenced separately and the most frequently read base at a po-sition is chosen for the sequenced individual. Expected coverage = 5(A), 10(B), 20(C) per individual. For pooling experiments,the expected total coverage is k. Since the pool size is not rele-vant in this context, we plot results for completely dependent (leftcolumn) and independent sequencing errors (right) instead. Seethe methods section for a more detailed description of these sce-narios.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    20/25

    20 NGS EXPERIMENTS WITH POOLED SAMPLES

    ooooo

    oooo

    oooo

    ooo

    oo

    oo

    oo

    oo

    oo

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    5 10 15 20 25 30 35 40

    0

    10

    20

    30

    40

    (a) lambda=5

    k

    correspondingk

    withpooling

    ++++++

    +++++

    ++++

    ++++

    ++++

    ++++

    ++++

    +++

    ++

    xxxxxx

    xxxxxx

    xxxxxx

    xxxxxx

    xxxxxx

    xxxxxx

    ooooooooo

    oooooo

    oooo

    oooo

    oo

    oo

    oo

    oo

    o

    o

    o

    o

    o

    5 10 15 20 25 30 35 40

    0

    10

    20

    30

    40

    (b) lambda=10

    k

    correspondingk

    withpooling

    ++++++++++

    ++++++++

    +++++++

    ++++++

    +++++

    xxxxxxxxxxxx

    xxxxxxxxxxx

    xxxxxxxxxx

    xxx

    Figure 3. Sequencing effort k of a pooling experiment in or-

    der to get allele frequency estimates with the same accuracy as ina standard experimental setup where k individuals are sequencedseparately. (o: pool size n = 50, +: n = 100, x: n = 500.)

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    21/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 21

    0 50 100 200

    0

    2

    4

    6

    8

    10

    Wattersons theta

    n

    meanestimate

    0 50 100 200

    0

    2

    4

    6

    8

    10

    Tajimas pi

    n

    meanestimate

    Figure 4. Expected value of the estimates obtained from pooled

    samples depending on the pool size n: Wattersons and Tajimas. True value = 10 (green line). There is a considerable bias,if n is small compared to k, illustrating the need to use a biascorrection with the estimates. Solid black line: = 30; red dashedline: = 5. (For Tajimas , the bias does not depend on .)

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    22/25

    22 NGS EXPERIMENTS WITH POOLED SAMPLES

    0 50 100 150 200

    0.4

    0.8

    1.2

    1.6

    Wattersons theta, m.a.f>=1

    n

    varianceratio

    0 50 100 150 200

    0.4

    0.8

    1.2

    1.6

    Tajimas pi, m.a.f.>=1

    n

    varianceratio

    0 50 100 150 200

    0.4

    0.8

    1.2

    1.6

    Wattersons theta, m.a.f.>=2

    n

    varianceratio

    0 50 100 150 200

    0.4

    0.8

    1.2

    1.6

    Tajimas pi, m.a.f.>=2

    n

    varianceratio

    0 50 100 150 200

    0.4

    0.8

    1.2

    1.6

    Wattersons theta, m.a.f.>=3

    n

    varianceratio

    0 50 100 150 200

    0.4

    0.8

    1.2

    1.6

    Tajimas pi, m.a.f.>=3

    n

    varianceratio

    Figure 5. Variance ratio (V arpooled/V arstandard) of the bias cor-rected version of Wattersons and Tajimas depending on thepool size n. We consider pooling both without (minor allele fre-quency (m.a.f.) 1), and with a protection (m.a.f. 2, m.a.f. 3)against sequencing errors. (Only segregating sites with minor allelefrequency m.a.f. above the stated threshold are included.) Thehorizontal green line denotes the break even ratio of one, whereboth the pooled and the classical experiment leads to estimateswith equal variances. Pooling always performs better, as soon asthe size of the pool exceeds the number of separately sequencedindividuals. Solid black line: = 30; red dashed line: = 5.

    Standard setup with k = 10 individuals sequenced separately.

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    23/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 23

    0 50 100 150 200

    0

    2

    4

    6

    8

    10

    Wattersons theta

    n

    bias

    (solid),sd

    (dashed)

    0 50 100 150 200

    0

    2

    4

    6

    8

    10

    Tajimas pi

    n

    bias

    (solid),sd

    (dashed)

    0 50 100 150 200

    0

    10

    30

    50

    Wattersons theta

    n

    squared

    bias(solid),var(dashed)

    0 50 100 150 200

    0

    10

    30

    50

    Tajimas pi

    n

    squared

    bias(solid),var(dashed)

    Figure 6. Bias (solid lines) and variance (dashed lines) of Watter-sons and Tajimas depending on the extent of heterogeneity inprobe material. Black lines: moderate heterogeneity (scale =2); redlines: high heterogeneity (scale =8). In the first row bias and stan-

    dard deviations are plotted for the population genetic estimates.The second row contains the squared bias and the variance, thatadd up to the mean squared error. (Further parameters: = 30,k = 10; log-normal parameters: = 0, = log(scale).)

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    24/25

    24 NGS EXPERIMENTS WITH POOLED SAMPLES

    0 50 100 150 200

    0.0

    0.5

    1.0

    1.5

    Wattersons theta, scale = 2

    n

    MSEratio

    0 50 100 150 200

    0.0

    0.5

    1.0

    1.5

    Tajimas pi, scale = 2

    n

    MSEratio

    0 50 100 150 200

    0.0

    0.5

    1.0

    1.5

    Wattersons theta, scale = 8

    n

    MSEratio

    0 50 100 150 200

    0.0

    0.5

    1.0

    1.5

    Tajimas pi, scale = 8

    n

    MSEratio

    Figure 7. Mean squared error ratio (MSEpooled/MSEstandard) ofWattersons and Tajimas depending on the pool size n andfor = 30. Solid black line: The same amount of probe materialis available for all individuals. Red dashed line: the amount ofprobe material differs from individual to individual according tolog-normal factors. For the top two panels, both curves are nearlyidentical. The median factor is always one, and with a scale of two,about 32% of all probes deviate by a factor of more than the valuegiven by scale. For a scale value of two (for instance), 16% of probesinvolve more than double the median probe amount, and another16% contain less than one half the median amount. (Log-normal

    parameters: = 0, = log(scale), scale {2, 8}.)

  • 8/7/2019 MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

    25/25

    NGS EXPERIMENTS WITH POOLED SAMPLES 25

    100 200 300 400 500

    0.0

    0.5

    1.

    0

    1.

    5

    2.

    0

    2.5

    3.

    0

    n

    variances

    ratios

    Figure 8. Variance ratios (V arpooled/V arstandard) when estimat-ing allele frequencies in the case where the amount of probe ma-terial differs from individual to individual according to log-normalfactors. The median factor is always one, and with a scale of s,about 32% of all probes deviate by a factor of more than s. Forthe scale value s = 2 (for instance) 16% of probes involve morethan double the median probe amount, and another 16% containless than one half the median amount. Ratios smaller than oneindicate that pooling leads to estimates with a smaller variance.Individual sequencing is carried out for ten individuals with an ex-pected coverage of = 10. Scales: s = 2 (red dashed line), s = 4

    (green dotted line), s = 8 (blue dash-dotted line) (Log-normal pa-rameters: = 0, = log(scale), scale {2, 4, 8}.)