mgm workshop assembly tutorial alicia clum doe joint genome institute, walnut creek, ca
DESCRIPTION
MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA May 14, 2012. Contents. Vocabulary introduction Introduction to short-read genome sequencing and assembly Practical experience of short read genome assembly - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/1.jpg)
MGM Workshop
Assembly Tutorial
Alicia ClumDOE Joint Genome Institute,
Walnut Creek, CA
May 14, 2012
![Page 2: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/2.jpg)
1. Vocabulary introduction
2. Introduction to short-read genome sequencing and assembly
3. Practical experience of short read genome assembly
4. Improving genome assembly using 3rd generation sequencing
ContentsContents
![Page 3: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/3.jpg)
1. Vocabulary introduction
2. Introduction to short-read genome sequencing and assembly
3. Practical experience of short read genome assembly
4. Improving genome assembly using 3rd generation sequencing
ContentsContents
![Page 4: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/4.jpg)
Vocabulary
• Fragment library: a short insert (270bp) library with overlapping ends. Aka std library
• Long insert library: A 4-8kb library where only 100 bp on each end are sequenced. Aka CLIP, mate pair library
• Contig: A contiguous sequence of DNA • Scaffold: One or more contigs linked
together by unknown sequence• Captured gap: A gap within a scaffold. The
order and orientation of the contigs spanning the gap is known
A B C D E
![Page 5: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/5.jpg)
2. Introduction to short-read genome sequencing and assembly
• Short read sequencing and assembly basics• Short read assembly - De Bruijn graph example• Short read assembly – Scaffolding
ContentsContents
![Page 6: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/6.jpg)
150bp450bp
650bp
1
$400
$15 $0.11,000
20,000
Traditional genome sequencing technology
Short-read
$$! - We have to figure out how to sequence microbial genomes using only illumina data
Why sequence genomes using Why sequence genomes using short reads?short reads?
![Page 7: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/7.jpg)
Short read genome sequencingShort read genome sequencing
How do we assemble this data back into a genome?
GenomicDNA
270 bpfragments
Random fragmentation
4-8 kbfragments
Paired-end long insert
reads(10’s millions)
Paired-end short insert
reads(10’s millions)
molecular biology
Sequencing(Illumina)
![Page 8: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/8.jpg)
Assembly outlineAssembly outline
Assembly algorithms
e.g. Allpaths, Velvet,
Meraculous
Contigs
Scaffolds
Reads
![Page 9: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/9.jpg)
Assembly outlineAssembly outline
Contigs
Scaffolds
Reads
‘De Bruijn’ assembly
![Page 10: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/10.jpg)
2. Introduction to short-read genome sequencing and assembly
• Short read sequencing and assembly basics• Short read assembly - De Bruijn graph example• Short read assembly – Scaffolding
ContentsContents
![Page 11: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/11.jpg)
De Bruijn exampleDe Bruijn example
“It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness, it was the
epoch of belief, it was the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Example courtesy of J. Leipzig 2010
![Page 12: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/12.jpg)
De Bruijn exampleDe Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Generate random ‘reads’ How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof
stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie
eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor
ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof
esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit
stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft
ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi
astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe
ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth
ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo
ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof
astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
De Bruijn solution: Represent the data as a graph (scales with genome size)
Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2)A million (106 ) reads requires a trillion (1012) pairwise alignments
![Page 13: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/13.jpg)
De Bruijn exampleDe Bruijn exampleStep 1: Convert reads into “Kmers”
Reads: theageofwi
age
geo
eof
ofw
fwi
sthebestof
sth
the
heb
ebe
bes
est
sto
tof
astheageof
ast
sth
the
hea
eag
age
geo
eof
worstoftim
wor
ors
rst
sto
tof
oft
fti
tim
imesitwast
ime
mes
esi
sit
itw
twa
was
ast
…..etc for all reads in the dataset
Kmers :(k=3)
the
hea
eag
Kmer: a substring of defined length
![Page 14: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/14.jpg)
De Bruijn exampleDe Bruijn example
Step 2: Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthesth the
heb ebe bes est sto tof
ast sththe hea eag age geo eof
wor ors rststo tof
oft fti tim
ime mes
esisititwtwa
was
ast
…..etc for all ‘kmers’ in the dataset
![Page 15: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/15.jpg)
De Bruijn exampleDe Bruijn example
Step 3: Simplify the graph as much as possible:
A De Bruijn Graph
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of
foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “
De Bruijn assemblies ‘broken’ by repeats longer than kmer
![Page 16: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/16.jpg)
No single solution!
Drawback of De Bruijn approach
Break graph to produce final assembly
Step 4: Dump graph into consensus (fasta)
![Page 17: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/17.jpg)
Kmer size is an important parameter in Kmer size is an important parameter in De Bruijn assemblyDe Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Why not always use longest ‘k’ possible?
Sequencing errors:
sthebentof
sth theheb
ebeben
entnto
tof
sthebentof
k=3
k=10100% wrong kmer
Mostly unaffected
kmers
![Page 18: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/18.jpg)
2. Introduction to short-read genome sequencing and assembly
• Short read sequencing and assembly basics• Short read assembly - De Bruijn graph example• Short read assembly – Scaffolding
ContentsContents
![Page 19: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/19.jpg)
ScaffoldingScaffolding
Contigs
Scaffolds
(An assembly)
Reads‘De Bruijn’ assembly
“Captured” gaps caused by repeats. Represented by “NNN” in
assembly
Join contigs using evidence from paired end data
Align reads to DeBruijn contigs
![Page 20: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/20.jpg)
ContentsContents
1. Vocabulary introduction
2. Introduction to short-read genome sequencing and assembly
3. Practical experience of short read genome assembly
4. Improving genome assembly using 3rd generation sequencing
![Page 21: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/21.jpg)
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Contaminant reads(-> incorrect +
inflated assembly)
Chimeric reads (->mis-joins)Sequencing errors
(-> fragmented assembly)
** ***
**
*Worse than predicted assemblies!
![Page 22: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/22.jpg)
Real life assembly is messy!Real life assembly is messy!
Theoretical
GC% of 100 base windowsF
ract
ion
of
no
rmal
ized
co
vera
ge
Reference position (bp)
Co
vera
ge
(x)
![Page 23: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/23.jpg)
Genome properties can also Genome properties can also make assembly difficultmake assembly difficult
Biased sequence composition
RESULT: incomplete / fragmented assembly
ACTGTCTAGTCAGCGCGCGCGCGCGCGCCCGCGCGCGCGGGCGGCGGCGCGGGCGGGCGCATGTAGTGAT
C
High repeat content
RESULT: misassemblies /
collapsed assemblies
r
rrr
r
Polyploidy
RESULT: fragmented assembly
a a’
Biased sequence abundance
RESULT: Incomplete / fragmented assembly
![Page 24: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/24.jpg)
How do we get a good assembly?
![Page 25: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/25.jpg)
Key features of the JGI microbe Key features of the JGI microbe sequencing and assembly pipelinesequencing and assembly pipeline
GenomicDNA
270bp 4-8 kb
Longer insertsImproved scaffolding
2 x 150bp
Extensive data QC:-Remove artifacts-Remove contaminants-Reorder libraries
AllPaths LG assemblerMost complete + most accurate assembler in
our handsInternal error
correction
QCAssembler
2 x 100 bp
Best Assemblies
Fragment
SequenceOverlapping paired-end reads for short
insert libraryAllows for long ‘kmer’
Illumina V3 sequencing chemistry
Reduced GC bias
![Page 26: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/26.jpg)
Benefits of ALLPATHS
• Internal error correction of all data types• Simplifies the graph by removing kmers at low coverage
caused by errors
• Given the correct input data, a good assembly can be produced with default options
• Overlaps the fragment data to use a longer kmer size• Produces accurate and highly contiguous assemblies
velvet assembly results for error corrected standard data
0
500
1000
1500
2000
2500
3000
3500
4000
allpaths-lg control-no error correction msr-ca
error correction method
nu
mb
er o
f co
nti
gs
4084251
4092796
![Page 27: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/27.jpg)
Simulated De Bruijn assembly for six ‘known’microbial genomes
Kmer =30, most of the genome CAN be assembled
97%
Kmer = 150A small fraction of the genome remains ‘unassemblable’
What fraction of a genome should we be able to assemble, that is, can be represented in unique kmers?
Matt Blow
![Page 28: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/28.jpg)
0 50 100 150
B. Murdochii
C. Flavigena
C. Woesei
H. Turkmenica
S. Smaragdinae
Number of scaffolds
Short insert library only
Assemblies from simulated data
How fragmented are short read assemblies?How fragmented are short read assemblies?
Assemblies using only short insert sequencing libraries are acceptable (<100 scaffolds)
Goal: genome in 1 fragment per replicon
(7 replicons)
![Page 29: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/29.jpg)
Assemblies from real data
Short insert library only
Assemblies from simulated data
How good are short read assemblies?How good are short read assemblies?
Number of scaffolds
Assemblies using only short insert sequencing libraries are acceptable (<100 scaffolds)
Goal: genome in 1 fragment per replicon
(7 replicons)
![Page 30: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/30.jpg)
How good are short read assemblies?How good are short read assemblies?
Short + long insert
libraries
Assemblies using short and long insert sequencing libraries are very good (often a single fragment!)
Number of scaffolds
Assemblies from real data
Assemblies from simulated data
Goal: genome in 1 fragment per replicon
(7 replicons)
![Page 31: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/31.jpg)
Average number of fragments
Average % known genes
identified
85 97.3%
3 97.4%
Short insert library only
Short + long insert libraries
Comparison of assembly results to reference genome annotation
Comparison of assembly results to reference genome annotation
Near complete and accurate assemblies are now possible using only short read data.
![Page 32: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/32.jpg)
But problems remain:But problems remain:
Contigs
An assembly
Reads‘De Bruijn’ assembly
“Captured” gaps caused by repeats. Represented by “NNN” in
assembly
Join contigs using evidence from paired end data
Align reads to DeBruijn contigs
![Page 33: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/33.jpg)
Contents
1. Vocabulary introduction
2. Introduction to short-read genome sequencing and assembly
3. Practical experience of short read genome assembly
4. Improving genome assembly using 3rd generation sequencing
![Page 34: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/34.jpg)
Long reads from “3rd generation” Pacific Biosciences sequencer hold promise for improving short-read based assemblies
Maximum read length >4kb
1 32 4 5 6 7Read length (Kb)
2,000
4,000
6,000
8,000
10,00012,000
14,000Mean read length =
1080bp
Pacific Biosciences Sequencer
“Early” data
Low coverage biasReduced sensitivity to G+C rich regions compared to illumina chemistry
High error rateUp to 15% error rate In-del errors
Late April 2012
Mean read length = 2743 bp
GC% of 100 base windowsFra
cti
on
of
no
rmal
ized
co
vera
ge
Nu
mb
er o
f re
ads
![Page 35: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/35.jpg)
For each of 20 Microbes:
2 x 150bp270bp insert
“1000”bp
PacBio pilot study
2 x 100bp8kb insert
+
(Illumina HiSeq, V3 chemistry, 2x150b)
+
Genomic DNA
(PacBio)
Assembler: AllPaths V39750 (PacBio enabled)
Assembly 1
Assembly 2
![Page 36: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/36.jpg)
3rd Generation sequence data improves genome assembly
Least improved genomes (.. but started out in good shape)
Assembled using:Most improved genome: 53 / 71 (75%) gaps closed Illumina only
Illumina + PacBio
![Page 37: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/37.jpg)
PacBio assembly improvement is greatest for genomes with high GC content
MOST improved genomesaverage 66% G+C
LEAST improved genomesaverage 56% G+C
Conclusion:
3rd generation sequencing is a promising tool for better assemblies, especially for G+C rich genomes.
![Page 38: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/38.jpg)
SummarySummary
• High quality genome sequencing using only short-reads is within reach
• Short-read microbial genomes assemblies are minimally fragmented and contain the vast majority known genes
• Third-generation sequencing may provide an inexpensive path to finished genomes
![Page 39: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/39.jpg)
ENDEND
![Page 40: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/40.jpg)
Assembly Q+AAssembly Q+A
Alex Copeland(Assembly group lead)
Alicia Clum(Analyst,
Genome assembly)
James Han(QC group lead)
![Page 41: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/41.jpg)
Metagenome AssemblyMetagenome Assembly
1 10 100 1000 10000
Acid mine Soil
Species per metagenome
Cow rumen
-All challenges of isolate genome assembly remain-Extra challenges from diversity and different abundance of constituent genomes- The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes
![Page 42: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/42.jpg)
Road map for Long Mate Pair (LMP) library Improvement & Development
Fragmentation
Circularization
Paired tag
generation
Amplification
Purification
Clean up
efficiency
Hydroshear Covaris Transposon
Cre-Lox Hybridization Ligation
RE enzyme Nick translation shearing
PCR cycle #, specificity, choice of enzyme ---
CLIP LFPE Illumina
column
Electro-elution
2-D
Electrophoresis
![Page 43: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/43.jpg)
De Bruijn exampleDe Bruijn example
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Why not always use longest ‘k’ possible?
Sequencing errors:
sthebentof
sth theheb
ebeben
entnto
tof
sthebentof
k=3
k=10100% wrong kmer
Mostly unaffected
kmers
![Page 44: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/44.jpg)
Short read assemblies are Short read assemblies are improving over timeimproving over time
Sequenced genome properties remain constant
But illumina sequence quantity and quality is increasing…
…resulting in better microbial genome assemblies
Average results from sub-optimal “QC” assemblies
![Page 45: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/45.jpg)
Q1. What fraction of kmers are unique Q1. What fraction of kmers are unique in a typical microbial genome?in a typical microbial genome?
Kmer = 30, most of the genome is in unique kmers
97%
Kmer = 150A small fraction of the genome remains non-unique
(6 known microbe genomes, 15 kmer lengths)
Majority of genome contained in unique fragments
![Page 46: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/46.jpg)
Microbe
Q2. How do non-unique kmers fragment the assembly?
Assemblies will be fragmented!How do we improve this?
![Page 47: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/47.jpg)
Metagenome assembly is an Metagenome assembly is an ongoing challengeongoing challenge
-All challenges of isolate genome assembly remain-Extra challenges from diversity and different abundance of constituent genomes- The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes
![Page 48: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/48.jpg)
Useful ReviewsUseful Reviews
• Miller JR, Koren S, Sutton G. , Assembly algorithms for next-generation sequencing data. Genomics. 2010 Jun;95(6):315-27.
• Mihai Pop, Genome assembly reborn: recent computational challenges. Brief Bioinform (2009) 10 (4):354-366.
![Page 49: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/49.jpg)
Illumina data qualityIllumina data qualitySyntrophorhabdus aromaticivoransSyntrophorhabdus aromaticivorans
PASS
Read Quality
Genome Properties
Library Quality
Run Quality
![Page 50: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/50.jpg)
Illumina data qualityIllumina data quality Opitutaceae bacterium TAV2
Genome Properties
Library Quality
Run Quality
Read Quality
FAIL
![Page 51: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/51.jpg)
Metagenomes are harder to assembleMetagenomes are harder to assemble
![Page 52: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/52.jpg)
velvet gap size distribution of aligned contig shreds
0
20
40
60
80
100
120
< 100 100-999 1000-1999 2000-2999 3000-3999 > 4000
Gap size (bp)
nu
mb
er
of
ga
ps
4085750 std (trimmed toQ20)4085750 jumping(trimmed to Q20)4085750 std trimmed +jumping trimmed4086221 std
4086221 jumping
4086221 std + jumping
Velvet gaps Velvet gaps Fibrobacter succinogenes & Ignisphaera aggregansFibrobacter succinogenes & Ignisphaera aggregans
![Page 53: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/53.jpg)
Features of AssemblersFeatures of Assemblers
Algorithm Feature OLC Assemblers DBG Assemblers
Read features Base substitutions Euler, AllPaths, SOAPHomopolymer miscount CABOGConcentrated error in 3′ end EulerFlow space Newbler
Removal of erroneous reads Based on K-mer frequencies Euler, Velvet, AllPathsBased on K-mer freq and QV AllPathsFor multiple values of K AllPathsBy alignment to other reads CABOG
Base error correction Based on K-mer frequencies Euler, SOAPBased on Kmer freq and QV AllPathsBased on alignments CABOG
Graph construction Reads as graph nodes CABOG, Newbler, EdenaK-mers as graph nodes Euler, Velvet, ABySS, SOAPSimple paths as graph nodes AllPaths
Graph reduction Collapse simple paths CABOG, Newbler Euler, Velvet, SOAPErosion of spurs CABOG, Edena Euler, Velvet, AllPaths, SOAPBubble smoothing Edena Euler, Velvet, SOAPBubble detection AllPathsReads separate tangled paths Euler, SOAPBreak at low coverage Velvet, SOAPBreak at high coverage CABOG EulerHigh coverage indicates repeat CABOG Velvet
Graph partitions Partition by K-mers ABySSPartition by scaffolds AllPaths
Mate pairs Constrain path searches Euler, Velvet, AllPathsGuide path selection Euler, AllpathsMerge contigs or fill gaps CABOG, Shorty Velvet, ABySS, SOAPTransitive link reduction CABOG SOAPDetect, avoid repeat contigs CABOG Velvet, SOAPCreate scaffolds CABOG, Shorty Euler, Velvet, AllPaths, SOAP
J.R.Miller et al. Genomics 95 (2010)
![Page 54: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/54.jpg)
Pop M Brief Bioinform 2009;10:354-366
(A) Overlap between two read (agreement within overlapping region need not be perfect); (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Incorrect assembly produced by the greedy approach.(D) Disagreement between two reads (thin lines) that could extend a contig (thick line), indicating a potential repeat boundary. Contig extension must be terminated in order to avoid misassembly.
Greedy overlapGreedy overlap
CORRECT
INCORRECT
![Page 55: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/55.jpg)
Overlap graph of a genome containing a two-copy repeat (B).
Overlap-Layout-Consensus (OLC)Overlap-Layout-Consensus (OLC)
![Page 56: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/56.jpg)
Comparing Overlap and Comparing Overlap and de Bruijn Graphsde Bruijn Graphs
Schatz et al., Genome Res. (2010)
![Page 57: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/57.jpg)
Iterative kmer evaluation: Iterative kmer evaluation: IDBAIDBA
Y Peng, et al. IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler (2010)
![Page 58: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/58.jpg)
Jeremy Leipzig (jerdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html
![Page 59: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/59.jpg)
N50N50
The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E.
For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. (http://www.cbcb.umd.edu/research/castats.shtml)
N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater.
(Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)
![Page 60: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/60.jpg)
Theoretical performanceTheoretical performance
Cahill et al., PLoS ONE (2010)
Assessing performance of a range of read lengths
Repeat-induced gaps
![Page 61: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/61.jpg)
Supplement: Assembler Supplement: Assembler flowchartsflowcharts
![Page 62: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/62.jpg)
PhrapPhrap
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 63: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/63.jpg)
CAP3 & PCAPCAP3 & PCAP
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 64: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/64.jpg)
MIRAMIRA
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 65: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/65.jpg)
VelvetVelvet
Bamidele-Abegunde T. 2010 http://library2.usask.ca
![Page 66: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/66.jpg)
Supplement: Genome Supplement: Genome ImprovementImprovement
![Page 67: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/67.jpg)
Typical Microbial project
FINISHING
Annotation
Public release
Sequencing
Draftassembly
Goals:
Completely restore genome
Produce high quality consensus
![Page 68: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/68.jpg)
Metagenomic assembly and Metagenomic assembly and FinishingFinishing
• Typically size of metagenomic sequencing project is very large
• Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members
• Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies
• Chimerical contigs produced by co-assembly of sequencing reads originating from different species.
• Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly.
• No assemblers developed for metagenomic data sets
The whole-genome shotgun sequencing approach was used for a number of
microbial community projects, however useful quality control and assembly
of these data require reassessing methods developed to handle relatively
uniform sequences derived from isolate microbes.
![Page 69: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/69.jpg)
QC: Annotation of poor quality QC: Annotation of poor quality sequencesequence
To avoid this:
make sure you use high quality sequence
choose proper assembler
![Page 70: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/70.jpg)
De Bruijn graph for binary code 0000110010111101
K=4
Follow the blue numbered edges to resolve the graphCompeau et al., Nature Biotechnology (2011)
![Page 71: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/71.jpg)
Hamiltonian vs Eulerian cycle
Compeau et al., Nature Biotechnology (2011)
![Page 72: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/72.jpg)
Evaluating Assemblers
• When evaluating assemblers we use reference microbial datasets and assess• Number of scaffolds and contigs• Performance across a range of size, GC
content and repeat content• Accuracy of sequence produced • Gene content captured
ReferenceQ
ue
ry
![Page 73: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/73.jpg)
Node: A point, read or kmer
Edge: a line connecting two nodes
Graph: a network of nodes connected by edges
Vocabulary
Compeau et al., Nature Biotechnology (2011)
Graph
Node
Edge
![Page 74: MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56814544550346895db2100b/html5/thumbnails/74.jpg)
VocabularyVocabulary
Schatz et al., Genome Res. (2010)
•Kmer: a substring of defined length. For the purposes of this talk a substring of the sequence read•de Bruijn graph: a graph representing overlaps between kmers
Kmer
(k=3)
node edge