the 3rd generation of sequencing...

20
George: I have used the ‘Track changes’ function to mark my edits and suggestions. In case you are unfamiliar with this function, it can be found under Tools in the Standard toolbar. My queries are in bold type. Terms that could do with a glossary definition are in SMALL CAPS. (we can place the URL of your homepage in the online version of the article, where it can be linked to your affiliation via your autobiographical sketch (I’ll ask you to provide this later)) Personal genomes and the next generation of DNA sequencing technology. Jay Shendure^, Rob Mitra*, George M. Church^+ ^ Harvard Medical School, 77 Ave Louis Pasteur, Boston, MA 02115, USA. * Dept. of Genetics, Washington University School of Medicine, 4566 Scott Avenue, St. Louis, MO 63110, USA + Corresponding author, email: [email protected] Preface Nearly three decades have passed since the concurrent invention of the Maxam-Gilbert and Sanger methods for DNA sequencing. Automation and miniaturization of Sanger sequencing led to the development of high- throughput sequencing machines, the workhorses that delivered the human genome under-budget and ahead of schedule. A new generation of technologies aspire to push DNA sequencing to a new level, such that genomes of individual humans can be sequenced at a cost compatible with routine health care. Here we review technologies-under-development, and discuss the potential impact of the “Personal Genome Project” on both the research community and society. Introduction In the Human Genome Project (HGP), early investments in the development of cost-effective sequencing methods undoubtedly contributed to its resounding success. Over the course of a decade, through the refinement, parallelization, and automation of established sequencing methods, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per finished base to 10 finished bases per dollar [Col03]. The relevance and utility of high-throughput sequencing and sequencing centers in the wake of the HGP has been a subject of recent debate. The number of species for which canonical genomes are sequenced and assembled is rapidly growing, but

Upload: phunganh

Post on 03-Apr-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

George: I have used the ‘Track changes’ function to mark my edits and suggestions. In case you are unfamiliar with this function, it can be found under Tools in the Standard toolbar. My queries are in bold type. Terms that could do with a glossary definition are in SMALL CAPS.

(we can place the URL of your homepage in the online version of the article, where it can be linked to your affiliation via your autobiographical sketch (I’ll ask you to provide this later))

Personal genomes and the next generation of DNA sequencing technology.

Jay Shendure^, Rob Mitra*, George M. Church^+

^ Harvard Medical School, 77 Ave Louis Pasteur, Boston, MA 02115, USA.* Dept. of Genetics, Washington University School of Medicine, 4566 Scott Avenue, St. Louis, MO 63110, USA+ Corresponding author, email: [email protected]

Preface

Nearly three decades have passed since the concurrent invention of the Maxam-Gilbert and Sanger methods for DNA sequencing. Automation and miniaturization of Sanger sequencing led to the development of high-throughput sequencing machines, the workhorses that delivered the human genome under-budget and ahead of schedule. A new generation of technologies aspire to push DNA sequencing to a new level, such that genomes of individual humans can be sequenced at a cost compatible with routine health care. Here we review technologies-under-development, and discuss the potential impact of the “Personal Genome Project” on both the research community and society.

Introduction

In the Human Genome Project (HGP), early investments in the development of cost-effective sequencing methods undoubtedly contributed to its resounding success. Over the course of a decade, through the refinement, parallelization, and automation of established sequencing methods, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per finished base to 10 finished bases per dollar [Col03]. The relevance and utility of high-throughput sequencing and sequencing centers in the wake of the HGP has been a subject of recent debate. The number of species for which canonical genomes are sequenced and assembled is rapidly growing, but the list of species that are both interesting and relevant is ultimately finite. Nevertheless, a number of academic and commercial efforts are pushing for new ultra-low-cost sequencing (ULCS) technologies that aim to reduce the cost of DNA sequencing by several orders of magnitude. Why? First, whereas many biomedical and bioagricultural goals are not practical with current cost structures, they are quite justifiable at a costs such as 1 million bases per dollar. Second, achieving sequencing at the low costs achieved by the HGP requires involvement of a large sequencing center, and these are few in number and under heavy demand. Many of the new technologies aim to place high-throughput sequencing within reach of individual labs or core-facilities. Finally, such technologies have the potential to bring genome sequencing out of the laboratory and into the clinic, as they approach costs at which the complete sequencing of “personal genomes” might be affordable.

Here we review novel sequencing technologies that are under development, and discuss their relative feasibility and progress-to-date. The technologies can be generally classified into one of the following: (a) micro-electrophoretic methods, (b) sequencing-by-hybridization, (c) cyclic-array sequencing, and (d) single-molecule sequencing. The field is still relatively small; probably less than 10 academic labs

Page 2: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

and less than 10 companies are focused in this area. All technologies are at an early stage of development, such that it is difficult to gauge the time-frame before any given method will truly be practical and living up to expectations. Nevertheless, significant progress has been made with relatively limited resources such that substantial further investment is likely justified.

Boosting the capacity research-oriented DNA sequencing was the original motivation for pursuing new technologies. Since 2001, the primary justification for these efforts has gradually shifted to the notion that the technology could become so affordable that sequencing the full genomes of individual patients would be justified from a health-care perspective [Jon01][Pra02] [Pen02][Sal03]. Here we use the phrase “Personal Genome Project” (PGP) to describe this goal. What are the potential health-care benefits? At what cost-threshold does the PGP become justified? With respect to issues such as consent, confidentiality, discrimination, and patient psychology, what are the risks?

The introduction could be expanded a bit. See comments within.

The success of the Human Genome Project (HGP) illustrates how even modest investments in developing cost-effective sequencing methods can have payoffs for the whole biomedical community. Over the course of a decade, through the refinement, parallelization, and automation of established sequencing technologies, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per base to 10 finished bases per dollar [Col03]. However, for a wide range of biomedical and bioagricultural goals, a strong need is evolving to accelerate the pace of these advances. Please pause to explain what these biomedical and bioagricultural forces are. This has been accompanied by calls to reduce the cost of sequencing a human genome to less than $5000, a price threshold below which individual patients could conceivably afford "personal sequences" [Jon01][Pra02] [Pen02][Sal03]. Please explain succinctly why this goal might be desirable – ie do we need it or is there simply a demand for it? This will impact health care, both directly by providing diagnostic and prognostic markers for the clinical setting, and indirectly by accelerating the pace of basic and clinical biomedical research. To meet this challenge for cheaper and more accurate high-throughput sequencing, scientists and engineers will have to further reduce DNA sequencing costs by a factor of 1,000. The PGP has to be introduced in more detail here – when was it first proposed, who are its proponents, what is the rationale behind the technological advances, how feasible is it and what is the timescale. Several technologies are being developed that could make the Personal Genome Project (PGP) a reality. Here we review potential PGP sequencing technologies and discuss the impact they could have on the biomedical community – could you expand this sentence a little – eg to say that the impact is not just technical or biomedical but also ethical.

We are increasingly seeing the potential fruits of genomics in the clinical setting, but clearly there is a long way to go. Sequencing of large numbers of human genomes has the potential to itself be a means of expediting this process.

The idea that the cost of sequencing a full human genome (effectively, the sum possibility of all genetic tests) could approach the current costs-to-patients of many single genetic tests is certainly appealing (i.e. more for your money). Here we consider the For example, what are the potential benefits of sequencing the genome of a perfect healthy baby? From multiple perspectives (anonymity, psychological, insurance, etc.) what are the risks?

Page 3: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

(This title is a bit grand! How about: ‘Traditional sequencing’ - ?)

In 1977 two research groups — one led by F. Sanger and the other by W. Gilbert — that were familiar with peptide and RNA sequencing methods, made a technical leap forward for sequencing in general and DNA sequencing in particular, by harnessing the amazing separation power of gel electrophoresis [Gil81,San88]. Although DIDEOXY and MAXAM–GILBERT SEQUENCING was costly relative to today (can you give numbers, for comparison?), many scientists adopted and improved these techniques. By 1985, enough progress (qualify this progress please) had been made that a small group of scientists set themselves the audacious goal of sequencing the entire human genome between 1990 and 2005. [Coo89][Col03]. This declaration met with considerable resistance from the wider community. At the time, many felt the cost of DNA sequencing was far too high and the sequencing community too fragmented to complete such a vast undertaking. However, the critics did not account for the rapid pace of technical and organizational innovation – what, specifically, was achieved? Ie what allowed the draft human sequence to be produced so quickly and cheaply?. In 2000, five years ahead of schedule, a useful draft sequence of the human genome was published (ref) at a cost of $300 million dollars, which was significantly under budget. Possibly even more significant was the appearance of a culture of technology and "open" data and software [Col03]. This sentence is rather terse but makes an important point (and that is picked up on later in the piece) and so should be expanded.

Why continue sequencing?

I have suggested a way of linking this section to the next – by all means change the link, provided we have one of some sort.

In addition to Homo sapiens, the genomes of over 160 organisms have been fully sequenced, as well as parts of the genomes of over 100,000 taxonomic species. There are currently 20 billion bases in international databases [Gen03], reflecting the complexity of life on earth. But is sequencing absolutely essential to understand life? Or are there other means by which we can learn about the biosphere and our own species? We argue that although sequencing the biosphere is unnecessary and impractical, whole-genome sequencing — when the technology is made faster and cheaper — will improve existing biological and biomedical investigations and help to develop new genomic and technological studies (see also Box 1).

Impact of sequencing on existing and new studies. By comparing the genomic sequences between organisms, we are learning a great deal about our own molecular program, as well as that of the other organisms in the biosphere [Ure03,Car03]. This approach will become more powerful as we sequence more genomes [Tho03]. An ultra low cost sequencing technology (ULCS) would greatly impact the way biologists work. Mutagenic or population genetic screens in model and non-model organisms would be more powerful if one could sequence genomes for responsible mutations in crosses. In addition to making well-established techniques more powerful, improved sequencing power could open up new research fields. For example, one could directly query the antibody diversity that is generated in response to disease. ULCS would benefit SYNTHETIC BIOLOGY and genome engineering, ranging from selecting new enzymes to building??? new chromosomes, both of which are powerful tools for perturbing or designing complex biological systems. Whether the goal is the accurate synthesis of DNA/RNA?? or combinatorial scrambling (please paraphrase), these large syntheses will have many unknowns (meaning?), and ULCS would help determine or ensure engineered DNA design specifications. Subject to the forces of mutation and selection, and it will be important to monitor base changes as they occur (meaning?). Even further beyond normal genomes than the above synthetics, lies DNA computing [Bra02] and DNA as ultracompact memory (nm3/bit rather than conventional 1e11 nm3/bit storage media like DVDs) (ref?). Please say a bit more about what DNA computing is and the use of DNA in ultracompact memory - readers might not have heard of these applications.

Page 4: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

The introduction mentions bioagricultural goals – could these be covered here?

The personal genome project and human health. Perhaps the most compelling reason to pursue the goal of achieving low cost, high-throughput sequencing technology is the impact it could have on human health. Inexpensive (defined as ?) sequencing would give us access not only to our inherited genetic make-up but to many aspects of our changing internal environment including how our body responds to pathogens and allergens and the genetic changes associated with cancer. It is occasionally claimed that all that we can afford (and hence all that we need) is information on "common" single nucleotide polymorphisms, SNPs, or the arrangements of these (haplotypes) [Gib03] in order to understand so-called "multifactorial" or "complex" diseases [Hol00]. In a non-trivial sense all diseases are "complex". As we get better at genotyping and phenotyping we simply get better at finding the factors contributing to ever lower penetrance and expressivity. A focus on "common" alleles will probably be successful for alleles maintained in human populations by heterozygote advantage (such as the textbook relationship between sickle cell anaemia and malaria) but would miss most of the genetic diseases documented so far [Vit03]. In any case, even for diseases that are amenable to the HAPLOTYPE MAPPING approach, ultra high throughput sequencing would allow geneticists to move more quickly from a haplotype that is linked to the disease to the causative SNP. Additionally, many candidate loci could be investigated by sequencing them across large populations. For diseases caused by multiple but rare mutations, it is absolutely critical to directly sequence the causative SNP, and a whole genome sequencing technology would provide geneticists with the ability to do this. What advances in terms of speed, cost & accuracy would make it possible to achieve this goal?

Another medically important area that ULCS could make a large impact is cancer biology. The ability to sequence and compare complete genomes from many normal, neoplasticThe comprehensive detection of all somatic mutations (base changes, genomic rearrangements, and epigenetic variation) in the genome should capture most, if not all, of the factors involved in cancer initiation and progression, and allow us toexhaustively catalogue the molecular pathways and checkpoints that are inactivated as a tumor develops.

Is the PGP feasible?One reason for the overwhelming success of sequencing is that the number of nucleotides that can be sequenced at a given price has increased exponentially for the past 30 years (Figure 1). TThis exponential trend is by no means guaranteed and realizing a PGP in the next five years [who has set this timescale?] probably requires a higher commitment to technology than was available in the pragmatic and production-oriented HGP (figure 1). How might this be achieved? Obviously we cannot review technologies that are secret, but a number of truly innovative approaches have now been made public, marking this as an important time to compare and to conceptually integrate these innovative strategies ?????. We review four major approaches below. (see Fig 2 – this figure should contain simple diagrams that illustrate the four techniques ).

Sequencing technologiesThis section should be fleshed out a little.Micro-electrophoretic Sequencing. Nearly all publicly available DNA sequences have been obtained using technology that is based on the dideoxy sequencing method developed by Fred Sanger’s group? 26 years ago [ref]. Typically 99.99% accuracy can be achieved with as few as three RAW READS. The Mathies and Matsudaira groups are currently investigating just how inexpensive Sanger sequencing can be made. By using microfabrication techniques developed in the computer industry, they are

Page 5: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

building miniature DNA sequencers that can rapidly separate the dideoxynucleotide ladders. They are also working to create microfabricated devices that can perform PCR, allowing several sequencing steps to be integrated. Read lengths of up to 900 bases on a 384-lane microfabricated sequencing device have been demonstrated [Emr02], and work is being undertaken to create a single device that will perform DNA amplification, purification and sequencing in an integrated fashion.

Hybridization sequencing. Other groups are utilizing the property of single stranded DNA to preferentially hybridize to a complementary sequence. This fact can be exploited in several ways to sequence DNA. One approach, demonstrated by Hyseq, is to immobilize the DNA to be sequenced on a membrane or glass chip and then perform serial hybridizations with 6-mer oligonucleotides. By monitoring which 6mers bind, the unknown sequence can be decoded. This strategy can be used for both resequencing and de novo sequencing [Drm01]. A more widespread strategy for this ‘hybridization resequencing’ method is to attach the oligonucleotides to the glass surface instead of the sample DNA — an approach first commercialized in the Affymetrix HIV chip in 1995 [Lip95]. Perlegen greatly extended this approach (how?) to allow the resequencing of human genomes [Pat01]. The current maximum density of the chip is about one oligonucleotide "feature" per 5 microns (each feature contains approximately 100,000 copies of a 25 base pair oligonucleotide). For each base pair to be resequenced in the reference sequence there are four features on the chip. The middle base pair of these four features is either an “A”,”C”,”G”, or “T”. The sequence that flanks the variable middle base is identical for all four features and matches the reference sequence. By hybridizing labeled sample DNA to the chip and determining which of the four features shown the strongest signal for each base pair in the reference sequence, a DNA sample can be rapidly resequenced.

Cyclic array sequencing. This class includes methods previously classified under the name “sequencing-by-synthesis”. However, since nearly all of the methods reviewed here have critical "synthesis" steps, we choose to emphasize the cyclic nature of this class. Please add a sentence here to explain what the ‘cyclic’ that defines this class refers to. In 1984, three key components (what do you mean by ‘components’ – were these 3 different technologies or three features of one technology?) of this class emerged:multiplexing in space and time, avoiding bacterial clones, and cycling [Chu84]. This led to the first commercially sold genome sequence (which one? ref?), however its electrophoretic component held it back (in what way? Too slow? Too expensive? )_. In 1996, Ronaghi and Nyren introduced Pyrosequencing, in which a sequencing primer is extended by adding a single type of nucleotide triphosphate. Extension is detected by monitoring pyrophosphate release. Repeated cycles of nucleotide addition are performed and by recording which bases are added. This approach to DNA sequencing is attractive because it is non-electrophoretic and miniaturizable from the original 9 mm spacing??. Mostafa Ronaghi and Ron Davis at the Stanford Genome Center, and 454, a company located in New Haven (Connecticut, USA), are developing methods to perform single molecule PCR in 50 micron wells and then sequence these amplification products [Lea03]. This has been used recently to sequence an Adenoviral genome [- please give an idea of cost/speed) Sar03]. By integrating the cloning amplification and sequencing steps into a miniaturized format, they expect a significant improvement in the cost (numbers?). One way to achieve this integration has been to use polymerase colonies, or polonies [Mit99] (See figure 2c). In this technology, millions of individual DNA molecules are amplified by the polymerase chain reaction (PCR) in an acrylamide gel attached to the surface of a glass microscope slide. Because the acrylamide restricts the diffusion of the DNA, each single molecule included in the reaction produces a colony of DNA (a polony). This can also be accomplished on beads in oil/water emulsions [Dre03]. With sizes as small as 1 micron, a conventional slide can carry up to 2 billion beads. To sequence these polonies we perform sequential single base extensions with reversible dye-labeled deoxynucleotides [Mit03B] with READ-LENGTHS of at least 28 cycles.

Page 6: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

Please link this para to the previous one In the Lynx Massively Parallel Signature Sequencing (MPSS), each single molecule of DNA in a library is labeled with a unique oligonucleotide tag. Next, the library is amplified, and hybridized to oligonucleotides combinatorially synthesized on glass beads. After the hybridization reaction, every bead contains approximately 100,000 DNA molecules, each amplified from the same single library molecule. These are sequenced by cycles of fluorescent linker ligation and removal of 4 basepairs with a restriction enzyme. The accuracy is quite high and the 20 basepair read-lengths are adequate for many purposes [Bre00].

Single molecule sequencing. Several groups are attempting to sequence single molecules of DNA. This approach could lead to cost reductions by eliminating the cloning and amplification steps. In addition, small reaction volumes (how small?) are required. One such single molecule technology is nanopore sequencing, which is being developed by Agilent, Branton and Deamer groups. In this technology, as DNA passes through a 1.5 nm pore, different base pairs obstruct the pore to varying degrees resulting in changes in the electrical conductance of the pore. The accuracies of base calling range from 60% for single events to 99.9% for 15 events [Win03]. However the method is limited so far to the terminal bp of a specific type of hairpin.

The cyclic fluorescent base extension used on microscopic polonies described above can be scaled down further to single molecules. Solexa is developing a technology in which millions of single DNA molecules can be sequenced in parallel using a sequencing-by-synthesis protocol. To do so, the sequencing primer is annealed to DNA templates and single base extensions are performed using nucleotides with reversible terminators and dye-labels. This approach allows single molecules to be detected with an impressive signal to noise ratio [ref?]. The Quake group has demonstrated that sequence information can be obtained from single DNA molecules using serial single base extensions and the clever use of energy transfer -?? to improve their signal to noise ratio. [Bra03]. Watt Webb’s group at Cornell has recently proposed a strategy in which DNA sequence is acquired by detecting, in real-time, the incorporation of dye-labeled nucleotide triphosphates by DNA polymerase. They achieve this by extending a sequencing primer with fluorescent nucleotides in a nanofabricated structure ,which they have termed a ‘zero mode waveguide’. By performing the reaction in a zero-mode waveguide, only a zeptoliter volume of the reaction is excited by the laser, so that in principle, one is only detecting fluorescent triphosphates that reside in the DNA polymerases active site [Lev03]. The Genovoxx team has shown the possibility of using standard optics and has given details on one class of reversible terminators [Fag04].

It is important to note that many of the advantages of "single molecules" can be achieved after a bit of nucleic acid amplification. How are these two sentences linked? The first mentions the advantages of amplifying the single mols but then the following one develops the use of single mols. In particular, single molecules allow determination of combinations of structures which are hard to disentangle in pools of molecules. For example, alternative RNA splicing contributes extensively to protein diversity and regulation but is poorly assayed by pooled RNAs on microarrays, while an amplified single molecules allow accurate measures of over 1000 alternative spliceforms in RNAs like CD44 [Zhu03]. Similarly haplotype (or diploid genotype) combinations of SNPs can be determined accurately from single cells or DNA molecules [Mit03A]. The ability to amplify large DNAs by MULTIPLE DISPLACEMENT AMPLIFICATION (MDA) or WHOLE GENOME AMPLIFICATION (WGA) is improving rapidly [Dea02][Nel03]. How is the improvement being achieved? This will enhance our ability to get complete sequence from single cells even when they are dead or impossible to grow in culture. [Sor04][Roo04].

Technologies compared.

Page 7: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

This section should be expanded and made much clearer by listing the specific goals that are desirable (in terms of speed, accuracy and cost) and assessing their feasibility by referring to the previously listed technologies. See also accompanying email.A key consideration is cost per base. Accuracy goals will depend on the application, ranging from 21 base RNA tags [Sah02] to nearly error free genomes (<1e-10). If the individual read errors are truly random with probability p, then the overall error rate for r reads is roughly (1-p)p^(r-1). So if one needs an error rate below 1e-5 and the raw error rate is 10% then one needs six independent reads. Assuming PGP technology contenders will be all have highly automated miniaturized processes, then labor and materials cost will be minor relative to the equipment amortization costs. Integrated genomics devices typically cost $50K to $500K but could approach the cost of a $2K computer. The productivity of such a device will depend on the efficiency of data use, typically slower than the maximum set by data transfer rates (about 100 Mbits per second). For example, low initial accuracy could cost a factor of 10 to 100, avoiding spatial or tag overlaps as in MPSS & 454 can mean another 10 to 100. Real-time monitoring of asynchronous processes another similar factor. Chemical kinetics (e.g. electrophoresis and oligonucleotide synthesis are typically much slower than computers so it is crucial that these not be costly to parallelize.

Implications of sequencing human genomes

Please structure this section under separate subheadings to make it more organized - eg: Biomedical pros and cons; Ethical and social issues (access; public versus private; anonymity; linking genomes to phenomes;)Can you make recommendations for how we should operate in addition to raising the main concerns?

As sequencing costs drop we will need champions of privacy and public access. It is increasingly evident that diverse data must be integrated in order to understand biological processes and to apply these to clinical diagnosis, therapeutics and forensics. As with other recording technologies (e.g. video), the rights of the public need to be balanced with those of the individual. How may the gathering of increasing amounts of genetic information be made compatible with ethical and legal requirements for privacy? Anything approaching a comprehensive genotype or phenotype (including molecular phenotypes) can reveal subjects’ identity as surely as conventional identifiers like name or video data. Just as individuals are discouraged from masking their identity to video surveillance, there may be cases where "masking" of genomics information should not be permitted. Even if laws continue to be enacted to reduce "genetic" discrimination for insurance coverage and in the workplace [ORNL03], as long as job applicants and citizens seeking insurance expect favorable treatment based on assessment of their physical state, "functional genomics" may play a larger role in improving accuracy than inherited sequences alone. At the other end of the spectrum from protecting privacy, there will be individuals willing to make their genome and phenome publicly available. While anonymous data has served the HGP and other studies well, the approach has limitations. How can comprehensive identifying genetic information be gathered and made available to the research community? A few examples of non-anonymous public data sets exist. Craig Venter has published his own genome [Ven03]. A comprehensive identifying set of CT, MRI (spell out and define in glossary) and cryosection images (at 0.33 to 1 mm resolution) was made from the Joseph Jernigan shortly after his execution [NLM03]. Other examples include EEG and neuroanatomy of Albert Einstein [ Wit99], and many non-anonymous historical DNA analyses. A variety of motivations ranging from altruism to "early adopter" technophilia may arise to encourage individuals to make their comprehensive identifying data public. What subset of increasingly standardized [ref] electronic medical records could such individuals make public? Could these eventually be used to augment expensive epidemiological studies [Epi]? Currently we have few or no examples of a publicly available human genome plus phenome. [Fre03]. A framework survey and forum

Page 8: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

for potential volunteers to discuss risks and benefits might be a crucial reality check at this point [PGP]. Will the response be tiny or will it be as resounding as the Public Library of Science, open source, and Free Software Foundation [FSF] ?

Please add a Conclusions or Future directions section to round off the article – see accompanying letter.

Page 9: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

FIGURES

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

1970 1975 1980 1985 1990 1995 2000 2005 2010

bp/$bp/$*

Figure 1. Exponential growth scenarios in the cost-effectiveness of DNA sequencing. The magenta indicates a continuation of the long-term exponential growth in the cost, per base pair, of DNA sequencing. The blue line indicates the aggressive technology development curve that would be required to achieve PGP by 2010.

Page 10: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

Figure 2. Examples of raw data from each of the 4 major classes of potential PGP technologies.Top: Electrophoretic [Ron99], including secondary structure artifact at the third base. Second: Hybridization sequencing [Wan98] showing mixed sequence of heterozygote. Third: Cyclic Sequencing of the same sequence as in the top, showing no structure artifact, but the need for careful quantitation at homopolymer runs (the GG segment is double intensity). Bottom: Single molecule cyclic fluorescent base extensions, showing positive base calls at high Cy5 to Cy3 (red to green) ratios.

Page 11: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

TABLE 1. COST & THROUGHPUT COMPARISON OF SEQUENCING TECHNOLOGIES

Note: Many values here are our speculative guesses at where the technologies can reasonably go in 3 years and merely serve to encourage ongoing comparisons and cost modeling.

Method Liters per BP Read Length Error Rate Test Set (bp) Cost Per

Device ($)Raw bp/$, bp/hr

Capillary Fluidics 1E-06 600 < 0.1% 1E11 350K 100, 80K

ABI, Amersham, GenoMEMS, Caliper, RTS

Sequencing by Hybridization 1E-12 1 <5% 1E9 200K 1K, 1M

Affymetrix , Perlegen, Xeotron

Mass Spectrometry 1E-9 1 <5% 1E9 2M 3, 10K

Sequenom, Bruker

Unamplified Single Molecule 1E-23 >>40 <10% >80 30K - 1000K 100, 180K

Pore: Agilent/Harvard; Fluor: US Genomics, Solexa; FRET: VisiGen, Mobious, CalTech

In vitro Amplification with Multiplex SequencingLynx 1E-15 20 <3% 1E7 300K 5K, 1M

Pyrosequencing 1E-6 >50 <1% 1E6 100K 5K, 100K

FISSEQ 1E-13 >30? <1% 40 <90K >1M, 1M?

Other examples: Parallele, 454, Illumina

Box 1Partial List of Applications of Nucleic Acid Technology- Sequencing of individual human genomes as a component of preventative medicine [Pra02][Pen02].- In vitro and in situ gene expression profiling across the full range of spatiotemporal variables in the

development of a multicellular organism [Rey02].- Cancer: determination of comprehensive mutation sets for individual clones [Hah02]; loss of

heterozygosity analysis [Pau99]; profiling sub-types for diagnosis and prognosis [Gol99][Ram03B].- Temporal profiling of B- & T-cell receptor diversity, both clinically and in lab antibody selection.- Identification of known and novel pathogens [Web02]; biowarfare sensors [Ste02].- Detailed annotation of the human genome via PHYLOGENETIC SHADOWING [Bof03].- Quantitation of alternative splice variants in transcriptomes of higher eukaryotes [Rob02]. - Definition of epigenetic structures (e.g. chromatin modifications and methylation patterns) [Rob02b].- Rapid hypothesis testing for genotype-phenotype associations.- In situ or ex vivo discovery of patterns of cell lineage [Yat01][Dym02].- Characterization of microbial strains subjected to extensive “directed evolution” [Len03][Coo03].- Exploration of microbial diversity towards agricultural, environmental, and therapeutic goals [Gil02].

Page 12: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

- Annotation of microbial genomes via selectional analysis of tagged insertional mutants [Bad01].- Aptamer technology for diagnostics and therapeutics [Cer02].- DNA Computing [Bra02].

References

The references should be cited as superscript numbers in the text, in the order in which the references are cited. The style that should be used in the reference list is: 1. Author, H. F., Author, J. D. & Author, D. P. Title. Journal abbreviation 234, 52-60 (2002). 2. Author, K. & Author, N. R. Title. Journal abbreviation 4, 453-468 (2002). 3. Author, M. W. Title. Journal abbreviation 25, 1-25 (2002). 4. Author, O. B. & Author, E. A. Title. Journal abbreviation 237, 896-898 (1987). i.e. journal abbreviation in italics, volume number in bold. If there are six or more authors for a reference, only the first author should be listed, followed by 'et al.'. Journal name abbreviations are followed by a full stop. Please include full page ranges.

[Bad01] 1. Badarinarayana, V., Estep, P. W., Shendure, J., Edwards, J., Tavazoie, S., Lam, F. & Church, G.M. Selection analyses of insertional mutants using subgenic-resolution arrays. Nature Biotechnology 1, 1060-5 (2001).

[Bra02] 2. Braich, R. S., Chelyapov, N., Johnson, C., Rothemund, P. W. & Adleman, L. Solution of a 20-variable 3-SAT problem on a DNA computer. Science 296, 499-50 (2002).

[Bra03] 3. Braslavsky, I., Hebert, B., Kartalov, E. & Quake, S. R. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A. 100, 3960-4 (2003).

[Bre00] 4. Brenner S., et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol. 18, 630-4 (2000).

[Bof03] 5. Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K. D., Ovcharenko, I., Pachter, L. & Rubin, E. M. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. 299, 1391-4 (2003).

[Car03] 6. Carroll, S. B. Genetics and the making of Homo sapiens. Nature, 422, 840-57 (2003).

[Cer02] 7. Cerchia, L., Hamm, J., Libri, D., Tavitian, B., & de Franciscis, V. Nucleic acid aptamers in cancer medicine. FEBS Lett 528, 12-6 (2002).

[Chu84] 8. Church, G. M. & Gilbert, W. Genomic sequencing. Proc Natl Acad Sci U S A 81, 1991-5 (1984).

[Col03] 9. Collins, F. S., Morgan, M. & Patrinos, A. The human genome project: lessons from large-scale biology. Science 300, 286-90 (2003).

[Coo03] 10. Cooper, T. F., Rozen, D. E. & Lenski, R. E. Parallel changes in gene expression after 20,000 generations of evolution in Escherichia coli. Proc Natl Acad Sci U S A 100, 1072-7 (2003).

[Coo89] 11. Cook-Deegan, R. M. The Alta summit, December 1984. Genomics 5, 661-3 (1989).

Page 13: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

[Dea02] 12. Dean, F. B., Hosono S., Fang L., Wu X., Faruqi A. F., Bray-Ward P., Sun Z., Zong Q., Du Y., Du J., Driscoll M., Song W., Kingsmore S.F., Egholm M. & Lasken R.S. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A. 99, 5261-6 (2002).

[Dre03] 13. Dressman, D., Yan, H., Traverso, G., Kinzler, K. W. & Vogelstein, B. Transforming

single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci U S A. ??, 8817-22 (2003).

[Drm01] 14. Drmanac R., Drmanac S., Baier J., Chui G., Coleman D., Diaz R., Gietzen D., Hou A., Jin H., Ukrainczyk T. & Xu C. DNA sequencing by hybridization with arrays of samples or probes. Methods Mol Biol. 170, 173-9 (2001).

[Dym02] 15. Dymecki, S. M., Rodriguez, C. I., Awatramani, R. B... Switching on lineage tracers using site-specific recombination. Methods Mol Biol 18, 309-34 (2002).[Emr02] 16. Emrich, C. A., Tian, H., Medintz, I. L., Mathies, R. A... Microfabricated 384-lane capillary array electrophoresis bioanalyzer for ultrahigh-throughput genetic analysis. Anal Chem. 74, 5076-83 (2002).[Epi] 17. DNM: [Epi] http://www.nurseshealthstudy.org/[Fag04] 18. DNM: [Fag04] Fagin, U, Hennig, C and Cherkasov, D (2004) Massive parallel single molecule sequencing with reversibly terminating nucleotides[Fra02] 19. DNM: [Fra02] Frank TS, et al. Clinical characteristics of individuals with germline mutations in BRCA1 and BRCA2: analysis of 10,000 individuals. J Clin Oncol 20(6):1480-90. [Fre03] 20. Freimer, N, and Sabatti,. The Human Phenome Project. Nat. Gen 3, 15-21 (2003).[FSF] 21. DNM: [FSF] http://www.plos.org/cgi-bin/plosSigned.pl http://sourceforge.net/ http://www.gnu.org/[Gen03] 22. DNM: [Gen03] http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=hulib&db=Nucleotide http://wit.integratedgenomics.com/GOLD/ http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi?uncultured=hide&unspecified=hide&m=0[Gib03] 23. Gibbs, RA, et, a. l... The International HapMap Project. Nature. 426, 789-96 (2003).[Gil81] 24. Gilbert, W... DNA sequencing and gene structure. Science. 1981, 1305-12 (1981).[Gil02] 25. Gillespie, D. E., Brady, S. F., Bettermann, A. D., Cianciotto, N. P., Liles, M. R., Rondon, M. R., Clardy, J., Goodman, R. M., Handelsman, J... Isolation of antibiotics turbomycin a and B from a metagenomic library of soil microbial DNA. Appl Environ Microbiol 68, 4301-6 (2002).[Gol99] 26. DNM: [Gol99] Golub, T.R., et al., (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): p. 531-7.[Hah02] 27. Hahn, W.. C., Weinberg, R.. A.. Mechanisms of Disease: Rules for Making Human Tumor Cells. N Engl J Med 34, 1593-1603 (2002).[Han00] 28. Hanahan, D and Weingberg, RA. The Hallmarks of Cancer. Cell 10, 51-70 (2000).[Hol00] 29. DNM: [Hol00] Holtzman NA, Marteau TM. Will genetics revolutionize medicine? N Engl J Med. 2000 Jul 13;343(2):141-4.[Jon01] 30. DNM: [Jon01] Joneitz, E (2001) Personal Genomes, Technology Review

Page 14: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

[Kur03] 31. DNM: [Kur03] Kurzweil, R http://www.kurzweilai.net/meme/frame.html?main=/articles/art0184.html[Lea03] 32. DNM: [Lea03] Leamon JH, Lee WL, Tartaro KR, Lanza JR, Sarkis GJ, deWinter AD, Berka J, Lohman KL. A massively parallel PicoTiterPlate based platform for discrete picoliter-scale polymerase chain reactions. Electrophoresis. 2003 Nov;24(21):3769-77. [Len03] 33. Lenski, R. E., Winkworth, C. L., Riley, M. A... Rates of DNA Sequence Evolution in Experimental Populations of Escherichia coli During 20,000 Generations. J Mol Evol 56, 498-508 (2003).[Lev03] 34. Levene, M. J., Korlach, J., Turner, S. W., Foquet, M., Craighead, H. G., Webb, W. W... Zero-mode waveguides for single-molecule analysis at high concentrations. Science. 299, 682-6 (2003).[Lip95] 35. Lipshutz, R. J., Morris, D., Chee, M., Hubbell, E., Kozal, M. J., Shah, N., Shen, N., Yang, R., Fodor, S. P... Using oligonucleotide probe arrays to access genetic diversity. Biotechniques. 19, 442-7 (1995).[Mil03] 36. DNM: [Mil03] Mikkilineni,V, Mitra, RD, Merritt, J, DiTonno, JR, Lemieux, B, Church, GM, Ogunnaike, B and Edwards, JS (2003) Digital Analysis of Gene Expression for quantitative monitoring of gene expression (in prep)[Mit99] 37. DNM: [Mit99] Mitra,R. D. and Church, G.M. (1999) In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res. 27: e34.[Mit03A] 38. Mitra, RD, Butty, V, Shendure, J, Williams, BR, Housman, DE, and Church, GM. Digital Genotyping and Haplotyping with Polymerase Colonies. Proc Natl Acad Sci, 100, 5926-5931 (2003).[Mit03B] 39. Mitra,RD, Shendure,J, Olejnik,J, Olejnik,EK, and Church,GM. Fluorescent in situ Sequencing on Polymerase Colonies. Analyt. Biochem 320, 55-65 (2003).[Nel03] 40. DNM: [Nel03] Nelson JR, Cai YC, Giesler TL, Farchaus JW, Sundaram ST, Ortiz-Rivera M, Hosta LP, Hewitt PL, Mamone JA, Palaniappan C, Fuller CW. TempliPhi, phi29 DNA polymerase based rolling circle amplification of templates for DNA sequencing. Biotechniques. 2002 Jun;Suppl:44-7.[NLM03] 41. DNM: [NLM03] National Library of Medicine. The Visible Human Project. http://www.nlm.nih.gov/research/visible/visible_human.html)[ORNL03] 42. DNM: [ORNL03] Oak Ridge National Laboratory. Human Genome Project Information: Genetics Privacy and Legislation. http://www.ornl.gov/TechResources/Human_Genome/elsi/legislat.html[Pat01] 43. Patil, N., et, a. l... Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 294, 1719-23 (2001).[Pau99] 44. Paulson, T. G., Galipeau, P. C., Reid, B. J... Loss of heterozygosity analysis using whole genome amplification, cell sorting, and fluorescence-based PCR. Genome Res 9, 482-91 (1999).[Pen02] 45. Pennisi, E.. Gene Researchers Hunt Bargains, Fixer-Uppers. Science 29, 735-6 (2002).[PGP] 46. DNM: [PGP] http://arep.med.harvard.edu/PGP/pub.html[Pra02] 47. DNM: [Pra02] Pray, L , A cheap personal genome? The Scientist Oct 2002.[Raj03] 48. Rajagopalan, H., Nowak, M. A., Vogelstein, B., Lengauer, C... The significance of unstable chromosomes in colorectal cancer Nat Rev Cancer. 3, 695-701 (2003).[Ram03B] 49. Ramaswamy, S., et, a. l... A molecular signature of metastasis in primary solid tumors. Nat Genet. 33, 49-54 (2003).

Page 15: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

[Rey02] 50. Reymond, A., Marigo, V., Yaylaoglu, M. B., Leoni, A., Ucla, C., Scamuffa, N., Caccioppoli, C., Dermitzakis, E. T., Lyle, R., Banfi, S., Eichele, G., Antonarakis, S. E., Ballabio, A... Human chromosome 21 gene expression atlas in the mouse. Nature 420, 582-6 (2002).[Rob02] 51. DNM: [Rob02] Roberts, G.C. and C.W. Smith (2002) Alternative splicing: combinatorial output from the genome. Curr Opin Chem Biol. 6(3): p. 375-83.[Rob02b] 52. Robyr, D., Suka, Y., Xenarios, I., Kurdistani, S. K., Wang, A., Suka, N., Grunstein, M... Microarray deacetylation maps determine genome-wide functions for yeast histone deacetylases. Cell 109, 437-46 (2002).[Ron98] 53. DNM: [Ron98] Ronaghi, M, M. Uhlen and P. Nyren, (1998) A sequencing method based on real-time pyrophosphate, Science. 281:363-365. [Roo04] 54. DNM: [Roo04] Rook MS, Delach SM, Deyneko G, Worlock A, Wolfe JL. Am J Pathol. 2004 Jan;164(1):23-33. Whole Genome Amplification of DNA from Laser Capture-Microdissected Tissue for High-Throughput Single Nucleotide Polymorphism and Short Tandem Repeat Genotyping.[Sah02] 55. DNM: [Sah02] Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE (2002) Nat Biotechnol. 20(5):508-12. Using the transcriptome to annotate the genome.[Sal03] 56. DNM: [Sal03] Salisbury, MW (2003) 14 Sequencing Innovations that could change the way you work. Genome Technology. [San88] 57. Sanger, F... Sequences, sequences, and sequences. Annu Rev Biochem. 5, 1-28 (1988).[Sar03] 58. DNM: [Sar03] Sarkis,Get al. (2003) Sequence Analysis of the pAdEasy-1 Recombinant Adenoviral Construct[San88] 59. DNM: Using the 454 Life Sciences Sequencing-by-Synthesis Method. NCBI AY370911 gi:34014919[Ste01] 60. Stephens, M., Smith, N. J. & Donnelly, P.. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 6, 978-989 (2001).[Ste02] 61. Steinmetz, L.M., et, a. l..,. Dissecting the architecture of a quantitative trait locus in yeast. Nature 416, 326-30 (2002).[Ste02B] 62. Stenger, D. A., Andreadis, J. D., Vora, G. J., Pancrazio, J. J... Potential applications of DNA microarrays in biodefense-related diagnostics. Curr Opin Biotechnol. 13, 208-12 (2002).[Sor04] 63. Sorensen, K. J., Turteltaub, K., Vrankovich, G., Williams, J., Christian, A. T... Whole-genome amplification of DNA from residual cells left by incidental contact. Anal Biochem. 324, 312-4 (2004).[Tho03] 64. Thomas, JW et, a. l... Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788-93 (2003).[Ure03] 65. Ureta-Vidal, A., Ettwiller, L., Birney, E.. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat Rev Genet 4, 251-62 (2003).[Ven03] 66. Venter JC. A part of the human genome sequence. Science. 2003, 1183-4 (2003).[Vit03] 67. DNM: [Vit03] Vitkup, D, Sander, C., and Church, GM. The spectrum of deleterious human mutations. [Wan98] 68. DNM: [Wan98] Wang et al., Science 280 (1998): 1077[Web02] 69. Weber, G, Jay Shendure, J, Tanenbaum, DM, Church, G. M., and Meyerson M. Microbial sequence identification by computational subtraction of the human transcriptome. Nature Genetics 30, 141-2 (2002).[Win03] 70. DNM: [Win03] Winters-Hilt et al. 2003 Biophys J. 84:967-76. Accurate classification of basepairs on termini of single DNA molecules.

Page 16: The 3rd Generation of Sequencing Technologies'arep.med.harvard.edu/twister/NRG_010704_1516_JAS.doc · Web viewPlease pause to explain what these biomedical and bioagricultural forces

[Wit99] 71. Witelson, S. F., Kigar, D. L., Harvey, T... Lancet. 19, 2149-53 (1999).[Yat01] 72. Yatabe, Y., Tavare, S., and Shibata, D... Investigating stem cells in human colon by using methylation patterns. Proc Natl Acad Sci USA 9, 10839-44 (2001).[Zhu03] 73. Zhu, J, Shendure,J, Mitra, RD, & Church, GM. Single Molecule Profiling of Alternative Pre-mRNA Splicing. Science 301, 836-8 (2003).

AcknowledgementsThe authors thank members of the polony community and Christian Hennig for sharing unpublished work.

Competing interests statement. The authors declare that they have potential competing financial interests based on advisory roles and/or patent royalties from Affymetrix, Agencourt, Agilent, Caliper, Genovoxx, Helicos, Lynx and Pyrosequencing.