mgm workshop assembly tutorial alicia clum doe joint genome institute, walnut creek, ca

MGM Workshop

Assembly Tutorial

Alicia ClumDOE Joint Genome Institute,

Walnut Creek, CA

May 14, 2012

1. Vocabulary introduction

2. Introduction to short-read genome sequencing and assembly

3. Practical experience of short read genome assembly

4. Improving genome assembly using 3rd generation sequencing

ContentsContents

Vocabulary

• Fragment library: a short insert (270bp) library with overlapping ends. Aka std library

• Long insert library: A 4-8kb library where only 100 bp on each end are sequenced. Aka CLIP, mate pair library

• Contig: A contiguous sequence of DNA • Scaffold: One or more contigs linked

together by unknown sequence• Captured gap: A gap within a scaffold. The

order and orientation of the contigs spanning the gap is known

A B C D E

• Short read sequencing and assembly basics• Short read assembly - De Bruijn graph example• Short read assembly – Scaffolding

ContentsContents

150bp450bp

$15 $0.11,000

20,000

Traditional genome sequencing technology

Short-read

$$! - We have to figure out how to sequence microbial genomes using only illumina data

Why sequence genomes using Why sequence genomes using short reads?short reads?

Short read genome sequencingShort read genome sequencing

How do we assemble this data back into a genome?

GenomicDNA

270 bpfragments

Random fragmentation

4-8 kbfragments

Paired-end long insert

reads(10’s millions)

Paired-end short insert

reads(10’s millions)

molecular biology

Sequencing(Illumina)

Assembly outlineAssembly outline

Assembly algorithms

e.g. Allpaths, Velvet,

Meraculous

Contigs

Scaffolds

Assembly outlineAssembly outline

Contigs

Scaffolds

‘De Bruijn’ assembly

ContentsContents

De Bruijn exampleDe Bruijn example

“It was the best of times, it was the worst of times, it was the

age of wisdom, it was the age of foolishness, it was the

epoch of belief, it was the epoch of incredulity,.... “

Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall

Example courtesy of J. Leipzig 2010

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…

Generate random ‘reads’ How do we assemble?

fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof

stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie

eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor

ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof

esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit

stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft

ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi

astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe

ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth

ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo

ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof

astheworst sitwasthew theageoffo eepochofbe

…etc. to 10’s of millions of reads

De Bruijn solution: Represent the data as a graph (scales with genome size)

Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2)A million (106 ) reads requires a trillion (1012) pairwise alignments

De Bruijn exampleDe Bruijn exampleStep 1: Convert reads into “Kmers”

Reads: theageofwi

sthebestof

astheageof

worstoftim

imesitwast

…..etc for all reads in the dataset

Kmers :(k=3)

Kmer: a substring of defined length

Step 2: Build a De-Bruijn graph from the kmers

age geo eof ofw fwihea eagthesth the

heb ebe bes est sto tof

ast sththe hea eag age geo eof

wor ors rststo tof

oft fti tim

ime mes

esisititwtwa

…..etc for all ‘kmers’ in the dataset

Step 3: Simplify the graph as much as possible:

A De Bruijn Graph

“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of

foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “

De Bruijn assemblies ‘broken’ by repeats longer than kmer

No single solution!

Drawback of De Bruijn approach

Break graph to produce final assembly

Step 4: Dump graph into consensus (fasta)

Kmer size is an important parameter in Kmer size is an important parameter in De Bruijn assemblyDe Bruijn assembly

The final assembly (k=3)

wor times itwasthe foolishness

incredulity age epoch be

st wisdom

of belief

A better assembly (k=20)

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…

Repeat with a longer “kmer” length

Why not always use longest ‘k’ possible?

Sequencing errors:

sthebentof

sth theheb

ebeben

entnto

sthebentof

k=10100% wrong kmer

Mostly unaffected

ContentsContents

ScaffoldingScaffolding

Contigs

Scaffolds

(An assembly)

Reads‘De Bruijn’ assembly

“Captured” gaps caused by repeats. Represented by “NNN” in

assembly

Join contigs using evidence from paired end data

Align reads to DeBruijn contigs

ContentsContents

Biased coverage (->gaps)

Assembly in reality

Real life assembly is messy!Real life assembly is messy!

Assembly in theory

Uniform coverage, no errors, no contamination

Contaminant reads(-> incorrect +

inflated assembly)

Chimeric reads (->mis-joins)Sequencing errors

(-> fragmented assembly)

** ***

*Worse than predicted assemblies!

Real life assembly is messy!Real life assembly is messy!

Theoretical

GC% of 100 base windowsF

Reference position (bp)

Genome properties can also Genome properties can also make assembly difficultmake assembly difficult

Biased sequence composition

RESULT: incomplete / fragmented assembly

ACTGTCTAGTCAGCGCGCGCGCGCGCGCCCGCGCGCGCGGGCGGCGGCGCGGGCGGGCGCATGTAGTGAT

High repeat content

RESULT: misassemblies /

collapsed assemblies

Polyploidy

RESULT: fragmented assembly

a a’

Biased sequence abundance

RESULT: Incomplete / fragmented assembly

How do we get a good assembly?

Key features of the JGI microbe Key features of the JGI microbe sequencing and assembly pipelinesequencing and assembly pipeline

GenomicDNA

270bp 4-8 kb

Longer insertsImproved scaffolding

2 x 150bp

Extensive data QC:-Remove artifacts-Remove contaminants-Reorder libraries

AllPaths LG assemblerMost complete + most accurate assembler in

our handsInternal error

correction

QCAssembler

2 x 100 bp

Best Assemblies

Fragment

SequenceOverlapping paired-end reads for short

insert libraryAllows for long ‘kmer’

Illumina V3 sequencing chemistry

Reduced GC bias

Benefits of ALLPATHS

• Internal error correction of all data types• Simplifies the graph by removing kmers at low coverage

caused by errors

• Given the correct input data, a good assembly can be produced with default options

• Overlaps the fragment data to use a longer kmer size• Produces accurate and highly contiguous assemblies

velvet assembly results for error corrected standard data

allpaths-lg control-no error correction msr-ca

error correction method

4084251

4092796

Simulated De Bruijn assembly for six ‘known’microbial genomes

Kmer =30, most of the genome CAN be assembled

Kmer = 150A small fraction of the genome remains ‘unassemblable’

What fraction of a genome should we be able to assemble, that is, can be represented in unique kmers?

Matt Blow

0 50 100 150

B. Murdochii

C. Flavigena

C. Woesei

H. Turkmenica

S. Smaragdinae

Number of scaffolds

Short insert library only

Assemblies from simulated data

How fragmented are short read assemblies?How fragmented are short read assemblies?

Assemblies using only short insert sequencing libraries are acceptable (<100 scaffolds)

Goal: genome in 1 fragment per replicon

(7 replicons)

Assemblies from real data

How good are short read assemblies?How good are short read assemblies?

Number of scaffolds

Assemblies using only short insert sequencing libraries are acceptable (<100 scaffolds)

(7 replicons)

How good are short read assemblies?How good are short read assemblies?

Short + long insert

libraries

Assemblies using short and long insert sequencing libraries are very good (often a single fragment!)

Number of scaffolds

Assemblies from real data

(7 replicons)

Average number of fragments

Average % known genes

identified

85 97.3%

3 97.4%

Short + long insert libraries

Comparison of assembly results to reference genome annotation

Near complete and accurate assemblies are now possible using only short read data.

But problems remain:But problems remain:

Contigs

An assembly

Reads‘De Bruijn’ assembly

“Captured” gaps caused by repeats. Represented by “NNN” in

assembly

Join contigs using evidence from paired end data

Align reads to DeBruijn contigs

Contents

Long reads from “3rd generation” Pacific Biosciences sequencer hold promise for improving short-read based assemblies

Maximum read length >4kb

1 32 4 5 6 7Read length (Kb)

10,00012,000

14,000Mean read length =

1080bp

Pacific Biosciences Sequencer

“Early” data

Low coverage biasReduced sensitivity to G+C rich regions compared to illumina chemistry

High error rateUp to 15% error rate In-del errors

Late April 2012

Mean read length = 2743 bp

GC% of 100 base windowsFra

For each of 20 Microbes:

2 x 150bp270bp insert

“1000”bp

PacBio pilot study

2 x 100bp8kb insert

(Illumina HiSeq, V3 chemistry, 2x150b)

Genomic DNA

(PacBio)

Assembler: AllPaths V39750 (PacBio enabled)

Assembly 1

Assembly 2

3rd Generation sequence data improves genome assembly

Least improved genomes (.. but started out in good shape)

Assembled using:Most improved genome: 53 / 71 (75%) gaps closed Illumina only

Illumina + PacBio

PacBio assembly improvement is greatest for genomes with high GC content

MOST improved genomesaverage 66% G+C

LEAST improved genomesaverage 56% G+C

Conclusion:

3rd generation sequencing is a promising tool for better assemblies, especially for G+C rich genomes.

SummarySummary

• High quality genome sequencing using only short-reads is within reach

• Short-read microbial genomes assemblies are minimally fragmented and contain the vast majority known genes

• Third-generation sequencing may provide an inexpensive path to finished genomes

ENDEND

Assembly Q+AAssembly Q+A

Alex Copeland(Assembly group lead)

Alicia Clum(Analyst,

Genome assembly)

James Han(QC group lead)

Metagenome AssemblyMetagenome Assembly

1 10 100 1000 10000

Acid mine Soil

Species per metagenome

Cow rumen

-All challenges of isolate genome assembly remain-Extra challenges from diversity and different abundance of constituent genomes- The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes

Road map for Long Mate Pair (LMP) library Improvement & Development

Fragmentation

Circularization

Paired tag

generation

Amplification

Purification

Clean up

efficiency

Hydroshear Covaris Transposon

Cre-Lox Hybridization Ligation

RE enzyme Nick translation shearing

PCR cycle #, specificity, choice of enzyme ---

CLIP LFPE Illumina

column

Electro-elution

Electrophoresis

The final assembly (k=3)

wor times itwasthe foolishness

incredulity age epoch be

st wisdom

of belief

A better assembly (k=20)

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…

Repeat with a longer “kmer” length

Why not always use longest ‘k’ possible?

Sequencing errors:

sthebentof

sth theheb

ebeben

entnto

sthebentof

k=10100% wrong kmer

Mostly unaffected

Short read assemblies are Short read assemblies are improving over timeimproving over time

Sequenced genome properties remain constant

But illumina sequence quantity and quality is increasing…

…resulting in better microbial genome assemblies

Average results from sub-optimal “QC” assemblies

Q1. What fraction of kmers are unique Q1. What fraction of kmers are unique in a typical microbial genome?in a typical microbial genome?

Kmer = 30, most of the genome is in unique kmers

Kmer = 150A small fraction of the genome remains non-unique

(6 known microbe genomes, 15 kmer lengths)

Majority of genome contained in unique fragments

Microbe

Q2. How do non-unique kmers fragment the assembly?

Assemblies will be fragmented!How do we improve this?

Metagenome assembly is an Metagenome assembly is an ongoing challengeongoing challenge

-All challenges of isolate genome assembly remain-Extra challenges from diversity and different abundance of constituent genomes- The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes

Useful ReviewsUseful Reviews

• Miller JR, Koren S, Sutton G. , Assembly algorithms for next-generation sequencing data. Genomics. 2010 Jun;95(6):315-27.

• Mihai Pop, Genome assembly reborn: recent computational challenges. Brief Bioinform (2009) 10 (4):354-366.

Illumina data qualityIllumina data qualitySyntrophorhabdus aromaticivoransSyntrophorhabdus aromaticivorans

Read Quality

Genome Properties

Library Quality

Run Quality

Illumina data qualityIllumina data quality Opitutaceae bacterium TAV2

Genome Properties

Library Quality

Run Quality

Read Quality

Metagenomes are harder to assembleMetagenomes are harder to assemble

velvet gap size distribution of aligned contig shreds

< 100 100-999 1000-1999 2000-2999 3000-3999 > 4000

Gap size (bp)

4085750 std (trimmed toQ20)4085750 jumping(trimmed to Q20)4085750 std trimmed +jumping trimmed4086221 std

4086221 jumping

4086221 std + jumping

Velvet gaps Velvet gaps Fibrobacter succinogenes & Ignisphaera aggregansFibrobacter succinogenes & Ignisphaera aggregans

Features of AssemblersFeatures of Assemblers

Algorithm Feature OLC Assemblers DBG Assemblers

Read features Base substitutions Euler, AllPaths, SOAPHomopolymer miscount CABOGConcentrated error in 3′ end EulerFlow space Newbler

Removal of erroneous reads Based on K-mer frequencies Euler, Velvet, AllPathsBased on K-mer freq and QV AllPathsFor multiple values of K AllPathsBy alignment to other reads CABOG

Base error correction Based on K-mer frequencies Euler, SOAPBased on Kmer freq and QV AllPathsBased on alignments CABOG

Graph construction Reads as graph nodes CABOG, Newbler, EdenaK-mers as graph nodes Euler, Velvet, ABySS, SOAPSimple paths as graph nodes AllPaths

Graph reduction Collapse simple paths CABOG, Newbler Euler, Velvet, SOAPErosion of spurs CABOG, Edena Euler, Velvet, AllPaths, SOAPBubble smoothing Edena Euler, Velvet, SOAPBubble detection AllPathsReads separate tangled paths Euler, SOAPBreak at low coverage Velvet, SOAPBreak at high coverage CABOG EulerHigh coverage indicates repeat CABOG Velvet

Graph partitions Partition by K-mers ABySSPartition by scaffolds AllPaths

Mate pairs Constrain path searches Euler, Velvet, AllPathsGuide path selection Euler, AllpathsMerge contigs or fill gaps CABOG, Shorty Velvet, ABySS, SOAPTransitive link reduction CABOG SOAPDetect, avoid repeat contigs CABOG Velvet, SOAPCreate scaffolds CABOG, Shorty Euler, Velvet, AllPaths, SOAP

J.R.Miller et al. Genomics 95 (2010)

Pop M Brief Bioinform 2009;10:354-366

(A) Overlap between two read (agreement within overlapping region need not be perfect); (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Incorrect assembly produced by the greedy approach.(D) Disagreement between two reads (thin lines) that could extend a contig (thick line), indicating a potential repeat boundary. Contig extension must be terminated in order to avoid misassembly.

Greedy overlapGreedy overlap

CORRECT

INCORRECT

Overlap graph of a genome containing a two-copy repeat (B).

Overlap-Layout-Consensus (OLC)Overlap-Layout-Consensus (OLC)

Comparing Overlap and Comparing Overlap and de Bruijn Graphsde Bruijn Graphs

Schatz et al., Genome Res. (2010)

Iterative kmer evaluation: Iterative kmer evaluation: IDBAIDBA

Y Peng, et al. IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler (2010)

Jeremy Leipzig (jerdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html

N50N50

The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E.

For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. (http://www.cbcb.umd.edu/research/castats.shtml)

N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater.

(Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)

Theoretical performanceTheoretical performance

Cahill et al., PLoS ONE (2010)

Assessing performance of a range of read lengths

Repeat-induced gaps

Supplement: Assembler Supplement: Assembler flowchartsflowcharts

PhrapPhrap

Bamidele-Abegunde T. 2010 http://library2.usask.ca

CAP3 & PCAPCAP3 & PCAP

MIRAMIRA

VelvetVelvet

Supplement: Genome Supplement: Genome ImprovementImprovement

Typical Microbial project

FINISHING

Annotation

Public release

Sequencing

Draftassembly

Goals:

Completely restore genome

Produce high quality consensus

Metagenomic assembly and Metagenomic assembly and FinishingFinishing

• Typically size of metagenomic sequencing project is very large

• Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members

• Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies

• Chimerical contigs produced by co-assembly of sequencing reads originating from different species.

• Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly.

• No assemblers developed for metagenomic data sets

The whole-genome shotgun sequencing approach was used for a number of

microbial community projects, however useful quality control and assembly

of these data require reassessing methods developed to handle relatively

uniform sequences derived from isolate microbes.

QC: Annotation of poor quality QC: Annotation of poor quality sequencesequence

To avoid this:

make sure you use high quality sequence

choose proper assembler

De Bruijn graph for binary code 0000110010111101

Follow the blue numbered edges to resolve the graphCompeau et al., Nature Biotechnology (2011)

Hamiltonian vs Eulerian cycle

Compeau et al., Nature Biotechnology (2011)

Evaluating Assemblers

• When evaluating assemblers we use reference microbial datasets and assess• Number of scaffolds and contigs• Performance across a range of size, GC

content and repeat content• Accuracy of sequence produced • Gene content captured

ReferenceQ

Node: A point, read or kmer

Edge: a line connecting two nodes

Graph: a network of nodes connected by edges

Vocabulary

Compeau et al., Nature Biotechnology (2011)

VocabularyVocabulary

Schatz et al., Genome Res. (2010)

•Kmer: a substring of defined length. For the purposes of this talk a substring of the sequence read•de Bruijn graph: a graph representing overlaps between kmers

node edge

mgm workshop assembly tutorial alicia clum doe joint genome institute, walnut creek, ca

assembly basicsshort

bruijn graph exampleshort

assembly scaffoldingcontents1

bruijn approachbreak

debruijn graph

short reads

short insert

datasetde bruijn examplestep

Documents

mgm portfolio

mgm resorts international (nyse: mgm) - sec.gov · mgm...

scanned by...

materialgruppenmanagement (mgm)

exhibit 99.1 mgm resorts international reports ......or 9%...

munich/hq aachen bamberg berlin Đà nẵng …...2017/10/05...

hamburg münchen berlin köln leipzig mgm consulting...

homophobic discourse in t w, clum

mgm annualreport2003

mgm resorts international investor presentation · mgm...

mgm workshop assembly tutorial matthew blow doe joint genome...

genome assembly at...

· synth drums bass trp 1+2 14 — 26 es i sol sol sol sol...

mgm | porto08

amazing walnut nutrition, benefits of walnut

mgm grands

steve merry - mgm polled herefords2 generations mgm polled...

walnut hill church – walnut hill church

mgm holdings inc. - amazon...

mgm history