practical course in genome...

Practical Course in Genome Bioinformatics

Day 1 - Friday 20th January 2017

Course Introduction

• Practical course in genome bioinformatics

• 5 credits

• alan.j.medlar@helsinki.fi

• Presents a genome project of a real biological organism with an emphasis on the practical aspects of the project

• Grading is based on 7 work reports returned after each of the course days

• http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/

What is involved in a Genome Project?

• After experimental design, library construction and other wet lab preparations, from a bioinformatics perspective, a genome project involves:

• Sequencing and sequencing platforms

• De novo Assembly

• RNA-sequencing and Mapping

• Ab initio Gene Prediction

• Protein annotation

• Submission and publication of genome in database

• Further downstream analysis

Pre-history of DNA sequencing

• 1866: Haeckel suggests that nucleus somehow responsible for transmission of heritable traits

• 1869: DNA, then called "nuclein", first isolated by Friedrich Miescher (largely ignored)

• 1870: Flemming identified chromosomes (coins terms "chromatin" and "mitosis").

• 1880-90s: Bovari suggests that chromosomes contain genetic material and different chromosomes contain different heritable traits

• Interim: Scientific consensus is that proteins contain genetic, heritable information

• 1944: Work by Avery, MacLeod, and McCarty demonstrate DNA to be fundamental to heredity

• 1953: Watson and Crick discover the structure of DNA using Franklin and Wilkin's X-ray crystallography research

• 1955: Protein sequence for Insulin determined (Sanger)

Pre-history of DNA sequencing

• Early 1970s: Enterobacteria phage λ possessed 5′ overhanging ‘cohesive’ ends, DNA polymerase used to fill in the ends with radioactive nucleotides

• Later generalised, but limited to short sequences, very tedious (2D fractionation, analytical chemistry)

• Mid 1970s: Single separation by polynucleotide length using electrophoresis through polyacrylamide gels (Coulson and Sanger 1975, Maxam and Gilbert 1977)

• Sanger and colleagues sequenced the first DNA genome (bacteriophage φX174) (Sanger et al. 1977) - complex, but widely adopted

(5,386 bp)

DNA Sequencing

• We refer to DNA sequencing methods as belonging to three generations

• First generation 1977: chain-termination ("Sanger") sequencing, (monoclonal, later PCR)

• Second generation 2005: parallel sequencing (high-throughput, next-generation)

• Third generation 2010/11: single molecule sequencing

Sanger and Gilbert Nobel prize in Chemistry 1980

Sanger Sequencing

• Read lengths ~900 bp

• Very high quality (used to verify NGS results)

Sanger Sequencing: Improvements

• Improvements to first generation sequencing enabled the process to be automated, for example:

• Phospho- or tritium-radiolabelling replaced with fluorometric based detection (only 1 lane required vs. 4)

• Capillary-based electrophoresis for improved detection

• Development of PCR (Mullis)

• Automation enabled commercial development and makes shotgun sequencing practical; sequencing overlapping fragments to assemble into longer contiguous sequences (contigs)

• Refer to genome sequencing methods as belonging to three generations

• Second generation 2005: massively parallel sequencing (high-throughput, next-generation)

DNA Sequencing

Massively Parallel Sequencing

• "Sequencing-by-synthesis" (Sanger is also SBS, both require DNA polymerase to produce the observable output)

• Amplification of DNA by bridging PCR

• Detection with CCD camera

• Massive number of reads per run (MiSeq 20M, NextSeq 1G ...)

Sequencing-by-synthesis

• Prepare sample

• Randomly fragment genomic DNA

• Ligate adapters to both ends of the fragments

Adapted from Illumina Sequencing Technology documentation

• Attach DNA to surface

• Bind single-stranded fragments randomly to the inside surface of the flow cell channels

• Bridge amplification

• Add unlabeled nucleotides and enzyme to initiate solid-phase bridge amplification

• Fragments become double stranded

• The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate

• Denature double-standed molecules

• Denaturation leaves single-stranded templates anchored to the substrate

• Complete amplification

• Several million dense clusters of double-stranded DNA are generated in each channel of the flow cell

• Determine base

• Sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase

• Image base

• After laser excitation, the emitted fluorescence from each cluster is captured and the base is identified

• Refer to genome sequencing methods as belonging to three generations

• Second generation 2005: massively parallel sequencing (high-throughput, next-generation)

DNA Sequencing

Single Molecule Sequencing

• Far longer read lengths (up to multi-hundred Kb!)

• Eavesdrops on DNA polymerase molecule contained in Zero-mode waveguides (ZMW)

• Measures current in nanopore to determine current basepair or kmer

Commercial Sequencing Platforms

• As of 2017, there are several options: Illumina, PacBio, IonTorrent, Oxford Nanopore, Roche 454 (obsolete, but still around)

• Important metrics from bioinformatics perspective:

• Average read length (basepairs)

• Total sequence output (basepairs per run)

• Error profile (average accuracy and platform-specific biases)

NGS Platforms: Illumina

• Read lengths (100 - 300 bp) and total sequence per run dependent on platform

• Error rate: <1%

• Errors tend to be substitutions, biased towards 3' end

Illumina MiSeq Error Profile

Schirmer et al. Nucl. Acids Res. 2015

NGS Platforms: Pacific Biosciences

• Variable length long reads

• Error rate: 11-15%(CLR) or <1% (CCS)

• Errors stochastically distributed

PacBio RS II read length distribution using P6-C4 chemistry

Rhoads and Au, Genomics, Proteomics & Bioinformatics, 2015

NGS Platforms: Others

• IonTorrent

• Roche 454 and SOLiD

• Oxford Nanopore

~400 bp single-end reads 80 M reads, 98% acc. homopolymer errors 700 bp single-end reads

1 M reads, 99% acc, homopolymer errors

50+(35/50) bp paired-end reads ~1.4 M reads, 99.9% acc palindrome errors, AT bias

up to 200 Kb reads ~1.5 Gb sequence, 88% acc.

indel errors

Gigabases per run vs. Read length

Genomic Data Visualisation

• Why is visualisation important?

• Provides an overview, makes it easier to spot errors

• Communicates work to collaborators

• Publication figures

• Crucial at all steps of a project!

DNAplotter - genome visualisation

DNAPlotter: circular and linear interactive genome visualizationCarver et al. Bioinformatics, 2008

Mauve - multiple genome alignment

Mauve: Multiple Alignment of Conserved Genomic Sequence With RearrangementsDarling et al. Genome Research, 2004

UCSC Genome Browser

The human genome browser at UCSCKent et al. Genome Research, 2002

Computer Exercises

• Today:

• Visualising a bacterial genome with DNAplotter

• Aligning multiple E.coli genomes with Mauve

• Using UCSC Genome Browser utilities

• Exercises: http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/Exercises_day1.pdf

practical course in genome...

Documents

practical skills evaluator training course · practical...

1-month practical course genome analysis lecture 3: residue...

practical course in genome...

bioinformatics and comparative genome analyses course

practical course in teletrafic

practical course 4

practical course

bacterial genome annotation - github pages · bacterial...

1-month practical course genome analysis evolution and...

practical course in wolof

practical piping course

8492553 practical alchemy course

mycom practical wcdma course

inverter mitsubishi - practical course

1-month practical course genome analysis lecture 4:...

1-month practical course genome analysis 2008 lecture 3:...

biztalk practical course preview

1-month practical course genome analysis homology searching...

advanced practical course in genome...

practical russian course