new generation sequencing technologies: an overview
DESCRIPTION
Adapted version of my technical Journal club presentation on the new sequencing technologies.TRANSCRIPT
Sequencing technologies –
the next generation
Paolo Dametto
30.08.2011
1953: Discovery of the structure of the DNA double helix
Nobel prize in Physiology or Medicine 1962
History of DNA sequencing
1953 Discovery of the structure of the DNA double helix
1972 Development of recombinant DNA technology, which permits isolation of defined fragments of DNA; prior to this, the only accessible samples for sequencing were from bacteriophage or virus DNA.
1977 The first complete DNA genome to be sequenced is that of bacteriophage φX174
1977 Frederick Sanger publishes "DNA sequencing with chain-terminating inhibitors“
1984 Medical Research Council scientists decipher the complete DNA sequence of the Epstein-Barr virus, 170 kb.
1987 Applied Biosystems markets first automated sequencing machine, the model ABI 370.
1990 The U.S. National Institutes of Health (NIH) begins large-scale sequencing trials on Mycoplasma capricolum, Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae
1995 Craig Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) publish the first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137 bases and its publication in the journal Science marks the first use of whole-genome shotgun sequencing, eliminating the need for initial mapping efforts.
1996 Pål Nyrén and his student Mostafa Ronaghi at the Royal Institute of Technology in Stockholm publish their method of pyrosequencing
1998 Phil Green and Brent Ewing of the University of Washington publish "phred" for sequencer data analysis.
2001 A draft sequence of the human genome is published
2004 454 Life Sciences markets a parallelized version of pyrosequencing.The first version of their machine reduced sequencing costs 6-fold compared to automated Sanger sequencing, and was the second of a new generation of sequencing technologies, after MPSS.
Sanger sequencing: chain-terminating inhibitors
A breakthrough: fluorescent chain-terminating inhibitors
Automated DNA sequencer
• Capillary electrophoresis• Costs reduced by 90%• Human operation 15 min/day/machine • 1 million bp/day
3730x/ DNA analyzer
First generation DNA sequencer
• Manual preparation of acrylamide gels• Manual loading of samples• Contigs of 500-600 bp• 2.4 millions bp/year(1000 years needed to sequence the human genome)
ABI PRISM 377
Next-generation sequencing (NGS):newer methods for DNA sequencing The potential of NGS technologies is akin to the early days of PCR, with one’s
imagination being the primary limitation of its use (Metzker ML, 2010, Nature review)
NGS platforms produce an enormous volume of data cheaply, so it expands the realm of experimentation beyond just determining the order of bases:
gene-expression studies (RNA-seq) identification of rare transcripts without prior knowledge of a particular gene alternative splicing identification
large-scale comparative and evolutionary studies
re-sequencing of human genomes to enhance our understanding of how genetic differences affect health and disease
The variety of NGS features makes it likely that multiple platforms coexist in the marketplace, with some having clear advantages for particular applications over others
NGS differs in template preparation, sequencing and imaging, and data analysis
Commercially available technologies: Roche/454 Illumina/Solexa Helicos BioSciences Life/APG – SOLiD system Pacific Biosciences Ion Torrent technology
Experimental Nanopore sequencing
NGS technologies overview
Roche/454 - Pyrosequencing
1. Emulsion-based sample preparation (emPCR)
Several thousandcopies of the sametemplate sequenceon each bead
on average 1.6 million wells
2. Pyrosequencing: non-electrophoretic, bioluminescence method that measures the release of inorganic pyrophosphate by proportionally converting it into visible light using a series of enzymatic reaction
Roche/454 - Pyrosequencing
DNA polymerase
(DNA)n + dNTP (DNA)n+1 + PPi
Nucleotide incorporation generates light seen as a peakin the Pyrogram trace
Video http://www.youtube.com/watch?v=kYAGFrbGl6E
Roche/454 - Pyrosequencing
3. Imaging Sequencing and de novo assembly of
the Mycoplasma genitalium genome
25 million bases in one four-hour run 96% coverage at 99.96% accuracy 100-fold increase in throughput over current
Sanger sequencing
Most of errors result from a broadening of signal distribution, particularly for large homopolymers (seven or more), leading to ambiguous base call
Future directions: increasing in throughput by miniaturization
of the fibre-optic reactors improvements to reduce cross-talking
between adjacent wells
Over 1300 publications...
Roche/454 - Pyrosequencing
Applications Whole genome sequencing Targeted resequencing Sequencing-based Transcriptome Analysis Metagenomics
Illumina/Solexa
1. Solid-phase amplification can produce 100-200 million spatially separated clusters, providing free ends to which a universal sequencing primer can be hybridized to initiate the NGS reaction
Sequencing by Cyclic Reversible Termination (CRT): CRT uses reversible terminators in a cyclic method that comprises nucleotide incorporation, fluorescence imaging and cleavage
1. a DNA polymerase, bound to the primed template, adds or incorporates just one fluorescently modified nucleotide
2. Unincorporated nucleotides are washed away and a four-color imaging is acquired by total internal reflection fluorescence (TIFR) using two laser
3. A cleavage step (TCEP, a reducing agent) removes the terminating group restoring the 3’-OH group and the fluorescent dye
Illumina/Solexa
3. Imaging
Illumina/Solexa
Paired reads are very powerful in all areas of the analysis because they provided very accurate read alignment and thus improved the accuracy and coverage of consensus sequence and SNP calling
Illumina/Solexa
Video http://www.youtube.com/watch?v=77r5p8IBwJk
Applications DNA sequencing Gene Regulation Analysis Sequencing-based Transcriptome Analysis SNPs and SVs discovery Cytogenetic Analysis ChIP-sequencing Small RNA discovery analysis
Illumina/Solexa
A whole human genome sequence was determined in 8 weeks to an average depth of ~ 40X, discovering ~ 4 new million SNPs and ~400000 SVs (with an accuracy <1% for both over-calls and under-calls)
Considering the whole human genome sequencing as a clinical tool in the near future: unravel the complexities of human variation in cancer and other diseases, paving the way for the use of personal genome sequences in medicine and healthcare
1861 publications...
Helicos BioSciences
The use of PCR is problematic for two reasons:1. PCR introduces an uncontrolled bias in template representation because its
efficiencies vary as a function of template properties
2. PCR introduces errors (generating false-positive SNPs)
Single-molecule sequencing has been developed to circumvent these problems
1. Template preparation: one pass-sequencing
The library preparation process is simple and fast and does not require the use of PCR. It results in single-stranded poly(dA)-tailed templates
Poly(dT) oligonucleotides are covalently anchored to glass cover slip at random positions, and they are used to capture the template strands and as primers for sequencing
Helicos BioSciences
Each cycle consists of:1. adding the polymerase and one
of the labeled nucleotide
2. rinsing, imaging of multiple positions
3. cleavage of the dye labels
224 cycles were performed to sequence the genome of the M13 virus to an average depth of >150X with 100% coverage
Helicos BioSciences2. Sequencing
3. Imaging
Helicos BioSciences
The system showed higher error rates compared to the previous platforms, mostly due to multiple incorporations in the presence of homopolymers
The two-pass sequencing improved the overall quality
Helicos BioSciences
Template preparation: two pass-sequencing
ChIP-seq Goren, A et al. (2010). Chromatin profiling by
directly sequencing small quantities of immunoprecipitated DNA. Nat Methods 7, 47-49.
Methy-seq Pastor WA et al. (2011). Genome-wide mapping of
5-hydroxymethylcytosine in embryonic stem cells. Nature. May 19;473(7347):394-7. Epub 2011 May 8
Direct RNA sequencing Ozsolak, F et al. (2010). Comprehensive
polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Cell 143, 1018-1029.
cDNA-Based DGE, RNA-Seq and Small RNA Sequencing Ting, DT et al. (2011). Aberrant overexpression of
satellite repeats in pancreatic and other epithelial cancers. Science 331, 593-6.
Lipson, D et al. (2009). Quantification of the yeast transcriptome by single-molecule sequencing. Nat Biotechnol 27, 652-658.
Helicos BioSciences
Video http://www.youtube.com/watch?v=TboL7wODBj4
Life/APG – SOLiD platform
Sequencing by ligation (SBL) uses another cyclic method that differs from CRT in its use of DNA ligase and a two-base-encoded probes
Life/APG has commercialized their SBL platform called support oligonucleotide ligation detection (SOLiD)
Two-base-encoded probes: an oligonucleotide sequence in which two interrogation bases are associated with a particular dye (e.g. AA, CC, GG, TT are encoded with a blue dye) there are 16 possible combinations, each dye is
associated with 4
1,2-probes indicates that the first and second nucleotides are the interrogation bases. The remaining bases consist of either degenerate or universal bases
A phosphorothiolate linkage is present between the fifth and six nucleotides of the probe sequence, which is then cleaved with silver ions.
Life/APG – SOLiD platformSOLiD sequencing Chemistry
1. Emulsion-based sample preparation (emPCR)
Life/APG – SOLiD platform
2. Chemical crosslinking to an amino-coated glass surface
Life/APG – SOLiD platform
3. SBL protocol
Upon the annealing of a universal primer, a library of 1,2-probes is added. Ligation of complementary probes follows.
Four-color imaging
The ligated 1,2-probes are chemically cleaved with silver ions to generate a 5’-PO4 group
The SOLiD cycle is repeated 9 times
The extended primer is then stripped and four more ligation rounds are performed, each with ten ligation cycles
3. SBL protocol
Life/APG – SOLiD platform
Video http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related
Life/APG – SOLiD platform ChIP-seq
Chromatin immunoprecipitation sequencing (ChIP-Seq) on the SOLiD™ System Publication: Nature Methods, (2009)
Chromosome length influences replication-induced topological stress Publication: Nature (2011)
Methy-seq Increased methylation variation in
epigenetic domains across cancer types Publication: Nature Genetics (2011)
Metagenomics The carnivorous bladderwort (Utricularia,
Lentibulaiceae) a system inflates Publication: Journal of Experimental Botany (2010)
cDNA-Based DGE, RNA-Seq and Small RNA Sequencing Evolution of yeast noncoding RNAs
reveals an alternative mechanism for widespread Intron loss Publication: Science (2010)
Pacific Biosciences
Pacific Biosciences All the aforementioned methods use enzymatic activities and various
termination approaches, leading to short sequence reads (max. 350 bp)
Real-Time DNA sequencing wants to exploit the high catalytic rates and the high processivity of the DNA polymerase, using the latter as a real-time sequencing engine in order to obtain longer reads. To fully harness the intrinsic speed, fidelity, and processivity of the DNApol , several technical challenges must be met simultaneously:
The speed at which each polymerase synthesizes DNA exhibits stochastic fluctuation, so polymerases must be observed individually
A high nucleotide concentration is required, so a reduction in the observation volume which allow single-molecule detection is needed
DNApol has to work with 100% fluorescently labeled dNTPs
A surface chemistry is required to retain the activity of DNApol and inhibits nonspecific absorption of labeled dNTPs
Pacific Biosciences Single Molecule Real Time (SMRT) DNA sequencing
The zero-mode waveguide (ZMW) design reduces the observation volume down to the zeptolitre range (10-21 l ), reducing the number of stray fluorescently labeled molecules that enter the detection layer for a given period
The residence time of phospholinked nucleotides in the active site is usually on the millisecond scale, and that correspond to a recorded fluorescence pulse
Pacific Biosciences
Video http://www.youtube.com/watch?v=_B_cUZ8hSYU
Pacific Biosciences An initial accuracy of the reading
was estimated at 83% at 1X. Common mistakes were insertion, deletion and mismatches. Up to 15X, the authors demonstrated
that the accuracy is >99%
In 2009, Pacific Biosciences reported improvements to their platform. E.Coli was sequenced at 38X covering 99.3% of the genome, with an accuracy of >99.999% average read length: 964 bp
Comparison of next-generation sequencing platforms
NGS technologies and personal genomes Human genome studies aim to catalogue SNPs and SVs and their
association to phenotypic differences, with the eventual goal of personalized genomics for medical purposes > Pharmacogenomics
Somatic mutations associated with acute myeloid leukemia have been identified using Illumina/Solexa (Ley T.J. et al. 2008 Nature)
Elucidation of both allelic variants in a family with a recessive form of Charcot-marie-Tooth disease using the SOLiD platform (Lupsky J.R. et al. in press N.Engl.J.Med.)
The Cancer Genome Atlas aims at discovering SNPs and SVs associated with major cancers (The Cancer Genome Atlas Research Network, 2011 Nature)
Beijing Genomics Institute (BGI) is working on the “1000 Plant & Animal Reference Genomes Project" aiming at generating reference genomes for 1,000 economically and scientifically important plant/animal species. They use Illumina/Solexa and SOLiD platforms
Sequencing services and the $1,000 genome Illumina announced a personal genome sequencing service that
provides 30-fold base coverage for the price of $48,000.
Complete Genomics offers a similar service with 40-fold coverage priced at $5,000. It is based on a business model that is reliant on huge customers volume. They use a newly optimized SBL protocol which uses a combinatorial probe anchor ligation (cPAL). Reagents: $4,400
The greatest challenge for current technology developers consists in closing the gap between $10,000 and $1,000 for a single genome. The timetable for the $1,000 draft genome is difficult to predict
Nanopore sequencing?
Nanopore sequencing
The system uses the Staphylococcus auereus toxin α-hemolysin, a robust heptameric protein which normally forms holes in membranes.
DNA and RNA can be electrophoretically driven through a nanopore of suitable diameter (Kasianowicz J.J. et al 1996 PNAS)
Nanopore sequencing – how does it work?
When a small voltage (~100 mV) is imposed across a nanopore in a membrane separating two chambers containing acqueous electrolytes, the ionic current through the pore can be measured
Molecules going through the nanopore cause disruption in the ionic current, and by measuring the disruption molecules can be identified.
Lipid bilayer with high electronic resistant
Ionic current
Hemolysin
Nanopore – exonuclease sequencing
Exonuclease
Aminocycledextrin adaptor
DNA to be sequenced
The DNA polymer passes through the nanopore itself
The nanopore is engineered to allow single-base resolution within the strand
A DNA polymerase, coupled with a α-hemolysin, synthesizes a new strand of DNA using as a template the polymer coming out of the pore
Video nanopore: http://www.youtube.com/watch?v=_rRrOT9gfpo&feature=related
Nanopore – strand sequencing
DNA Polymerase
Nanopore sequencing Advantages
minimal sample preparation no requirement for polymerase or ligase potential of very long read-lengths ( > 10,000 – 50,000 nt ) it might well achieve the $1,000 per mammalian genome goal the instrument is inexpensive
Challenges to slow down DNA translocation from microseconds per base to milliseconds to reduce stochastic motion of the DNA molecule in transit in order to decrease
the signal/noise ratio a stable support for the hemolysin heptamer
Ion torrent technology
http://lifetech-it.hosted.jivesoftware.com/videos/1016