robert arthur kevin lee xing liu pushkar pande gena tang racchit thapliyal tianjun ye
TRANSCRIPT
Robert ArthurKevin Lee
Xing LiuPushkar Pande
Gena TangRacchit Thapliyal
Tianjun Ye
Sequencing Methods
Experimental comparison of De Bruijn graph and Overlay graph assemblers
Preliminary Results
Lab Exercise
Sanger Sequencing◦ Cycle sequencing rxn◦ ddNTP-terminated dye-
labeled products◦ High-resolution
electrophoretic separation
◦ Parallelized in 96 or 384 capillaries
◦ Read lengths up to 1kBp◦ Raw accuracy up to
99.999%◦ Costs 50 ¢ per kB
Sequencing MethodsSequencing Methods
Second Gen. Sequencing◦ Cyclical array methods
454 Illumina AB SOLiD Polonator HeliScope
◦ Platforms vary in biochemistry and array generation yet conceptually similar in workflow
Sequencing MethodsSequencing Methods
IlluminaIllumina
Illumina continuedIllumina continued
AB SOLiDAB SOLiD
Create a DNA library
◦ Ligate adaptors to fragments
Emulsion PCR◦ Agarose beads ◦ Oil, water, PCR reagents◦ Results in 1 mill copies /
fragment for each bead
454 Pyrosequencing454 Pyrosequencing
Beads arrayed into picotiter plate◦ Immobilized via
addition of enzyme containing beads
◦ Each cell contains exactly 1 bead
Bst polymerase, luciferase, apyrase, ATP sulferylase used
More 454More 454
Even more 454Even more 454Example of OutputExample of Output
Flow Order
TACG
1-mer
2-mer
3-mer
4-mer
KEY (TCAG)
Measures the presence or absence of each nucleotide at any given position
Videos (454 Workflow)Videos (454 Workflow)
Videos (Pyrosequencing)Videos (Pyrosequencing)note: we did not choose the musicnote: we did not choose the music
Comparison of 2Comparison of 2ndnd Gen Gen PlatformsPlatforms
Sequencing Methods
Experimental comparison of De Bruijn graph and Overlay graph assemblers
Preliminary Results
Lab Exercise
De Bruijn Graph assemblers and De Bruijn Graph assemblers and Overlay Graph assemblersOverlay Graph assemblers
De Bruijn Graph assemblers◦ Velvet, Abyss, Euler
Overlay Graph assemblers◦ Newbler, Edena, SSAKE, VCAKE
Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate◦ Human chr 22, ~33.5M bases◦ Streptococcus Suis, NC_012925.1, ~2M bases◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases
Write anther C program to measure the quality of assemblers◦ N50 length◦ No. of contigs◦ Max contig length◦ No. of mis-assembled contigs
Synthetic Data used for Synthetic Data used for ExperimentsExperiments
De Bruijn graph assemblers are only suitable for short reads data
K limitation◦ Use Hash table or Sorting to index K-mers
Need use a unique key(value) to represent each K-mer K=16 416=232 <-> 32-bit integer (unsigned int) K=32 432=264 <-> 64-bit integer (unsigned long long) K>32? <-> multiple integer to represent the hash table key
Read LengthRead Length
Simulate reads from Streptococcus Suis 300 read length, 50X coverage, error
rate 0.1% Velvet default: K <= 31, so we use 31
# of contigs (total length)
N50 length # of misassembled contigs (total length)
Velvet 46515 (1716053 bp) 115 bp 5 (1346 bp)
Recompile velvet, K = 99
# of contigs (total length)
N50 length
# of misassembled contigs (total length)
Velvet 441(1974382 bp) 15328 bp 1 (34 bp)
It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality”
Quality and AccuracyQuality and Accuracy
Assemblers
# of contigs (total length)
N50 length
# of misassembled contigs (total length)
Velvet 336 (1525746 bp) 10.4 kbp 17 (156637 bp)
Edena 340 (1513259 bp) 9,8 kbp 0 (0 bp)
Simulate reads from Helicobacter acinonychis Sheeba
35 read length, 50X coverage, error rate 0.1%
Assemblers
# of contigs (total length)
N50 length
# of misassembled contigs (total length)
Velvet 1106 (1969617 bp) 5266 bp 12 (255594 bp)
Edena 1003 (1970342 bp) 6416 bp 0 (0 bp)
Simulate reads from Streptococcus Suis 35 read length, 50X coverage, error rate
0.1%
Overlap graph based assemblers are computing-expensive and use more memory◦ All-to-all alignment of reads, O(n2)◦ Use more memory to store overlap graph
Typically, number of reads is weigh larger than the number of K-mers
◦ Especially for short reads data With the same coverage and genome length, shorter
reads means more reads◦ It is stated that De Bruijn graph are more suitable
for NGS data Shorter reads, and high throughput
Runtime and Memory Runtime and Memory UsageUsage
Assemblers Time Memory
Velvet 33 secs ~220 M
SSAKE 26 mins ~900 M
VCAKE 107 mins ~1.1 G
Simulate reads from Streptococcus Suis 802995 reads 50 read length, 20X coverage, error rate
0.1% Xeon E5530 2.4 GHz
Recent advance of pattern matching algorithms and technical enable the use of overlap graph◦ Suffix tree, Suffix array, Prefix array, compressed suffix array
Suffix array◦ Be able to find overlap between reads in linear time◦ Usage of compressed suffix array can significantly reduce the
memory requirements of overlap graph assemblers Examples
◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel , De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18:802-809, 2008.
◦ Jared T. Simpson and Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373.
◦ Pasqual Pushkar and I have developed a parallel sequence assembler based on overlap
graph in our research project
However!However!
Assemblers Time Memory
Velvet 292 mins ~17 GB
Edena 37 mins ~7 GB
Pasqual 43 mins ~8 GB
Parallel Pasqual 9 mins ~8 GB
Simulate reads from Human chr22 6978908 reads 50 read length, 20X coverage, error rate
0.1% Xeon E5530 2.4 GHz with 4 cores/8 threads
H. influenzae◦ 30 ~ 300 length
Velvet does not work◦ K is fixed◦ If we use big K, the reads shorter than K can not
be assembled.◦ If we use small K, it is difficult to assemble the
long reads Overlap graph assemblers do not have this
issue◦ Newbler
Mixed Length ReadsMixed Length Reads
Controversial◦ It is still unclear about the relation between De Bruijn
graph and Overlap graph We can still conclude from the experiments
◦ Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler
◦ De Bruijn graph assemblers does not work for long reads◦ De Bruijn graph assemblers does not work for mixed
length reads (K is fixed)◦ Traditional overlap graph assemblers are slower and use
more memory, but latest assemblers are better than De Bruijn graph assemblers
ConclusionConclusion
Sequencing Methods
Experimental comparison of De Bruijn graph and Overlay graph assemblers
Preliminary Results
Lab Exercise
Quality score and length Quality score and length distributiondistribution
Mean length Median length Std devM19107 577.5849 569 83.9605
Quality score and length Quality score and length distributiondistribution
Mean length Median length Std devM19501 624.7172 621 78.4074
Quality score and length Quality score and length distributiondistribution
Mean length Median length Std devM21127 618.7576 616 81.5678
Quality score and length Quality score and length distributiondistribution
Mean length Median length Std devM21621 620.6305 621 83.978
Quality score and length Quality score and length distributiondistribution
Mean length Median length Std devM21639 573.384 564 66.5525
Quality score and length Quality score and length distributiondistribution
Mean length Median length Std devM21709 626.2459 624 78.2447
VelvetVelvet
Id K No. of contigs N50 Max length Total length % reads used
M19107 19 217160 16 665 2905543 97.3535
29 176741 26 655 3315033 88.7319
M19501 19 618036 13 429 4716286 78.9177
29 537077 18 490 5725530 35.5981
M21127 19 319999 15 483 3498613 91.4239
29 259942 24 416 3998418 73.0187
M21621 19 218872 16 640 3052522 93.7490
29 157853 26 838 3256837 87.5425
M21639 19 770867 13 628 5818868 85.0236
29 680339 19 601 7348599 46.1671
M21709 19 291156 16 768 3425632 95.7695 29 207736 25 816 3637419 83.8704
$> velveth <output_dir> <k-mer length> -fasta -long <reads.fasta>$> velvetg <output_dir>
Input: Fasta/FastqOutput: Fasta
WGS assembler (Celera)WGS assembler (Celera)
Id No.of Contigs N50 Max length Total length % reads usedM19107 236 11881 32038 1766060 96.3570M19501 214 1230 4519 278112 98.6032M21127 345 8349 26765 1947955 97.9181M21621 356 7791 30668 1892633 98.1710M21639 326 2092 9912 610813 98.3939M21709 520 4393 15002 1700040 98.5221
$> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff$> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg
Input: frg formatOutput: Fasta
• >50 separate programs make up the Celera Assembler pipeline
• runCA script helps manage them all
NewblerNewblerDe Novo Assembly
Id No.of Contigs N50 Max length Total lengthM19107 217 15659 38000 25112606M19501 75 157459 343196 106836011M21127 59 121256 316274 40693944M21621 50 138437 339424 50432798M21639 175 43023 182797 158028027M21709 52 140128 319869 69503256
Reference Assembly – (Haemophilus-influenzae-refseq.fasta)Id No.of Contigs N50 Max length Total length
M19107 1260 2496 10409 1224223M19501 988 3503 18724 1380153M21127 - - - -M21621 - - - -M21639 1272 2701 13712 1416318M21709 313 13836 70298 1607841
Input: .sffOutput: Fasta
$> runAssembly <reads.sff> // de novo assembly
MIRAMIRA
Id No.of Contigs N50 Max length Total length % reads used
M19107 208 18379 51687 1795134 95.7478
M19501 181 185484 321569 1901198 97.7347
M21127 89 81157 305626 1951240 97.4776
M21621 67 90877 253924 1887484 97.5015
M21639 175 90800 152373 2378888 98.1330
M21709 83 62871 197745 1840248 97.6776
MIRA stands for Mimicking Intelligent Read Assembly
$> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff
$> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log
Input: Fasta + qual + trace infoOutput: Fasta, Ace
Eagle view - M19107.aceEagle view - M19107.ace
Eagle view - M19501.ace Eagle view - M19501.ace
“Next-generation DNA sequencing” Shendure et. al, http://compgenomics2011.biology.gatech.edu/images/f/f9/Shendure-NatureBiotechnology-2008.pdf
“Next-generation DNA sequencing methods” Mardis et. al, http://compgenomics2011.biology.gatech.edu/images/5/59/Mardis-AnnuRevGenet-2008.pdf
Works CitedWorks Cited
Sequencing Methods
Experimental comparison of De Bruijn graph and Overlay graph assemblers
Preliminary Results
Lab Exercise
Download the Lab Exercise file from the Genome Assembly wiki page
Lab ExerciseLab Exercise