practical course in genome...
Post on 23-Sep-2020
2 Views
Preview:
TRANSCRIPT
Practical Course in Genome Bioinformatics
Day 1 - Friday 20th January 2017
Course Introduction
• Practical course in genome bioinformatics
• 5 credits
• alan.j.medlar@helsinki.fi
• Presents a genome project of a real biological organism with an emphasis on the practical aspects of the project
• Grading is based on 7 work reports returned after each of the course days
• http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/
What is involved in a Genome Project?
• After experimental design, library construction and other wet lab preparations, from a bioinformatics perspective, a genome project involves:
• Sequencing and sequencing platforms
• De novo Assembly
• RNA-sequencing and Mapping
• Ab initio Gene Prediction
• Protein annotation
• Submission and publication of genome in database
• Further downstream analysis
Pre-history of DNA sequencing
• 1866: Haeckel suggests that nucleus somehow responsible for transmission of heritable traits
• 1869: DNA, then called "nuclein", first isolated by Friedrich Miescher (largely ignored)
• 1870: Flemming identified chromosomes (coins terms "chromatin" and "mitosis").
• 1880-90s: Bovari suggests that chromosomes contain genetic material and different chromosomes contain different heritable traits
• Interim: Scientific consensus is that proteins contain genetic, heritable information
• 1944: Work by Avery, MacLeod, and McCarty demonstrate DNA to be fundamental to heredity
• 1953: Watson and Crick discover the structure of DNA using Franklin and Wilkin's X-ray crystallography research
• 1955: Protein sequence for Insulin determined (Sanger)
Pre-history of DNA sequencing
• Early 1970s: Enterobacteria phage λ possessed 5′ overhanging ‘cohesive’ ends, DNA polymerase used to fill in the ends with radioactive nucleotides
• Later generalised, but limited to short sequences, very tedious (2D fractionation, analytical chemistry)
• Mid 1970s: Single separation by polynucleotide length using electrophoresis through polyacrylamide gels (Coulson and Sanger 1975, Maxam and Gilbert 1977)
• Sanger and colleagues sequenced the first DNA genome (bacteriophage φX174) (Sanger et al. 1977) - complex, but widely adopted
(5,386 bp)
DNA Sequencing
• We refer to DNA sequencing methods as belonging to three generations
• First generation 1977: chain-termination ("Sanger") sequencing, (monoclonal, later PCR)
• Second generation 2005: parallel sequencing (high-throughput, next-generation)
• Third generation 2010/11: single molecule sequencing
Sanger and Gilbert Nobel prize in Chemistry 1980
Sanger Sequencing
• Read lengths ~900 bp
• Very high quality (used to verify NGS results)
Sanger Sequencing: Improvements
• Improvements to first generation sequencing enabled the process to be automated, for example:
• Phospho- or tritium-radiolabelling replaced with fluorometric based detection (only 1 lane required vs. 4)
• Capillary-based electrophoresis for improved detection
• Development of PCR (Mullis)
• Automation enabled commercial development and makes shotgun sequencing practical; sequencing overlapping fragments to assemble into longer contiguous sequences (contigs)
• Refer to genome sequencing methods as belonging to three generations
• First generation 1977: chain-termination ("Sanger") sequencing, (monoclonal, later PCR)
• Second generation 2005: massively parallel sequencing (high-throughput, next-generation)
• Third generation 2010/11: single molecule sequencing
DNA Sequencing
Massively Parallel Sequencing
• "Sequencing-by-synthesis" (Sanger is also SBS, both require DNA polymerase to produce the observable output)
• Amplification of DNA by bridging PCR
• Detection with CCD camera
• Massive number of reads per run (MiSeq 20M, NextSeq 1G ...)
Sequencing-by-synthesis
• Prepare sample
• Randomly fragment genomic DNA
• Ligate adapters to both ends of the fragments
Adapted from Illumina Sequencing Technology documentation
Sequencing-by-synthesis
• Attach DNA to surface
• Bind single-stranded fragments randomly to the inside surface of the flow cell channels
Sequencing-by-synthesis
• Bridge amplification
• Add unlabeled nucleotides and enzyme to initiate solid-phase bridge amplification
Sequencing-by-synthesis
• Fragments become double stranded
• The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate
Sequencing-by-synthesis
• Denature double-standed molecules
• Denaturation leaves single-stranded templates anchored to the substrate
Sequencing-by-synthesis
• Complete amplification
• Several million dense clusters of double-stranded DNA are generated in each channel of the flow cell
Sequencing-by-synthesis
• Determine base
• Sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase
Sequencing-by-synthesis
• Image base
• After laser excitation, the emitted fluorescence from each cluster is captured and the base is identified
• Refer to genome sequencing methods as belonging to three generations
• First generation 1977: chain-termination ("Sanger") sequencing, (monoclonal, later PCR)
• Second generation 2005: massively parallel sequencing (high-throughput, next-generation)
• Third generation 2010/11: single molecule sequencing
DNA Sequencing
Single Molecule Sequencing
• Far longer read lengths (up to multi-hundred Kb!)
• Eavesdrops on DNA polymerase molecule contained in Zero-mode waveguides (ZMW)
• Measures current in nanopore to determine current basepair or kmer
Commercial Sequencing Platforms
• As of 2017, there are several options: Illumina, PacBio, IonTorrent, Oxford Nanopore, Roche 454 (obsolete, but still around)
• Important metrics from bioinformatics perspective:
• Average read length (basepairs)
• Total sequence output (basepairs per run)
• Error profile (average accuracy and platform-specific biases)
NGS Platforms: Illumina
• Read lengths (100 - 300 bp) and total sequence per run dependent on platform
• Error rate: <1%
• Errors tend to be substitutions, biased towards 3' end
Illumina MiSeq Error Profile
Schirmer et al. Nucl. Acids Res. 2015
NGS Platforms: Pacific Biosciences
• Variable length long reads
• Error rate: 11-15%(CLR) or <1% (CCS)
• Errors stochastically distributed
PacBio RS II read length distribution using P6-C4 chemistry
Rhoads and Au, Genomics, Proteomics & Bioinformatics, 2015
NGS Platforms: Others
• IonTorrent
• Roche 454 and SOLiD
• Oxford Nanopore
~400 bp single-end reads 80 M reads, 98% acc. homopolymer errors 700 bp single-end reads
1 M reads, 99% acc, homopolymer errors
50+(35/50) bp paired-end reads ~1.4 M reads, 99.9% acc palindrome errors, AT bias
up to 200 Kb reads ~1.5 Gb sequence, 88% acc.
indel errors
Gigabases per run vs. Read length
Genomic Data Visualisation
• Why is visualisation important?
• Provides an overview, makes it easier to spot errors
• Communicates work to collaborators
• Publication figures
• Crucial at all steps of a project!
DNAplotter - genome visualisation
DNAPlotter: circular and linear interactive genome visualizationCarver et al. Bioinformatics, 2008
Mauve - multiple genome alignment
Mauve: Multiple Alignment of Conserved Genomic Sequence With RearrangementsDarling et al. Genome Research, 2004
UCSC Genome Browser
The human genome browser at UCSCKent et al. Genome Research, 2002
Computer Exercises
• Today:
• Visualising a bacterial genome with DNAplotter
• Aligning multiple E.coli genomes with Mauve
• Using UCSC Genome Browser utilities
• Exercises: http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/Exercises_day1.pdf
top related