assembling ngs data - imb winter school - 3 july 2012
TRANSCRIPT
![Page 1: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/1.jpg)
Assembling NGS data
Dr Torsten Seemann
IMB Winter School - Brisbane – Tue 3 July
![Page 2: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/2.jpg)
Ideal world
I would not need to give this talk!
AGTCTAGGATTCGCTATAGATTCAGGCTCTGATATATTTCGCGGGATTAGCTAGATCGCTATGCTATGATCTAGATCTCGAGATTCGTATAAGTCTAGGATTCGCTATAGATTCAGGCTCTGATATATTTCGCGGGATTAGCTA
Human DNA Non-existentUSB3 device
46 complete haplotype
chromosome sequences
![Page 3: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/3.jpg)
Real world
• Can’t sequence full-length native DNA– no instrument exists (yet)
• But we can sequence short fragments– 100 at a time (Sanger)– 100,000 at a time (Roche 454)– 1,000,000 at a time (Ion Torrent)– 100,000,000 at a time (HiSeq 2000)
![Page 4: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/4.jpg)
De novo assembly
• De novo assembly is the process of reconstructing the original DNA sequences using only the fragment read sequences
• Instinctively– like a jigsaw puzzle– involves finding overlaps between reads– sequencing errors will confuse matters
![Page 5: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/5.jpg)
Shakespearomics
• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me
• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;
• Majority ruleFriends, Romans, countrymen, lend me your ears;
![Page 6: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/6.jpg)
The awful truth
“Genome assembly is impossible.”
A/Prof. Mihai PopWorld leader in de novo assembly research.
He wears glasses so he must be smart
![Page 7: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/7.jpg)
Approaches
• greedy assembly
• overlap :: layout :: consensus
• de Bruijn graphs
• string graphs
• seed and extend
… all essentially doing the same thing,but taking different short cuts.
![Page 8: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/8.jpg)
Assembly recipe
• Find all overlaps between reads– hmm, sounds like a lot of work…
• Build a graph– a picture of read connections
• Simplify the graph– sequencing errors will mess it up a lot
• Traverse the graph– trace a sensible path to produce a consensus
![Page 9: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/9.jpg)
![Page 10: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/10.jpg)
Find read overlaps
• If we have N reads of length L– we have to do ½N(N-1) ~ O(N²) comparisons– each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!
• What counts as “overlapping” ?– minimum overlap length eg. 20bp– minimum %identity across overlap eg. 95%– choice depends on L and expected error rate
![Page 11: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/11.jpg)
N=6 → 15 alignment scores
Read# 1 2 3 4 5 6
1 - - - - - -
2 80 - - - - -
3 95 85 - - - -
4 0 30 20 - - -
5 0 0 25 70 - -
6 0 35 25 60 50 -
![Page 12: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/12.jpg)
Graph construction
Thicker lines mean
stronger evidence for
overlap
Node/VertexEdge/Arc
![Page 13: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/13.jpg)
A more realistic graph
![Page 14: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/14.jpg)
What ruins the graph?
• Read errors– introduce false edges and nodes
• Non-haploid organisms– heterozygosity causes lots of detours
• Repeats– if longer than read length– causes nodes to be shared, locality confusion
![Page 15: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/15.jpg)
Graph simplification
• Squash small bubbles– collapse small errors (or minor heterozygosity)
• Remove spurs– short “dead end” hairs on the graph
• Join unambiguously connected nodes– reliable stretches of unique DNA
![Page 16: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/16.jpg)
Graph traversal
• For each unconnected graph– at least one per replicon in original sample
• Find a path which visits each node once– the Hamiltonian path (or cycle)– provably NP-hard (this is bad)– unlikely to be single path due to repeat nodes– solution will be a set of paths which terminate at decision points
• Form a consensus sequence from path– use all the overlap alignments – each of these is a CONTIG
![Page 17: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/17.jpg)
Graph traversal
![Page 18: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/18.jpg)
What happens with repeats?
The repeated element is collapsed into a single contig
![Page 19: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/19.jpg)
Mis-assembled repeats
a b c
a c
b
a b c d
I II III
I
II
III
a
b c
d
b c
a b d c e f
I II III IV
I III II IV
a d b e c f
a
collapsed tandem excision
rearrangement
![Page 20: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/20.jpg)
The law of repeats
• It is impossible to resolve repeats of length S unless you have reads longer than S.
• It is impossible to resolve repeats of length S unless you have reads longer than S.
![Page 21: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/21.jpg)
Types of reads
• Example fragment– atcgtatgatcttgagattctctcttcccttatagctgctata
• “Single-end” read– atcgtatgatcttgagattctctcttcccttatagctgctata
– Sequence one end of the fragment
• “Paired-end” read– atcgtatgatcttgagattctctcttcccttatagctgctata
– Sequence both ends of same fragment– we can exploit this information!
![Page 22: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/22.jpg)
Scaffolding
• Paired-end reads– known sequences at either end
– roughly known distance between ends– unknown sequence between ends
• Most ends will occur in same contig– if our contigs are longer than pair distance
• Some ends will be in different contigs– evidence that these contigs are linked!
![Page 23: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/23.jpg)
Contigs to Scaffolds
Contigs
Paired-end read
Scaffold Gap Gap
![Page 24: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/24.jpg)
![Page 25: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/25.jpg)
What can we assemble?
• Genomes– A single organism eg. its chromosomal DNA
• Meta-genomes– gDNA from mixtures of organisms
• Transcriptomes– A single organism’s RNA inc. mRNA, ncRNA
• Meta-transcriptomes– RNA from a mixture of organisms
![Page 26: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/26.jpg)
Genomes
• Expect uniformity– Each part of genome represented by roughly
equal number of reads
• Average depth of coverage– Genome: 4 Mbp– Yield: 4 million x 50 bp reads = 200 Mbp– Coverage: 200 ÷ 4 = 50x (reads per bp)
![Page 27: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/27.jpg)
Meta-genomes
• Expect proportionality & uniformity– Each genome represented by proportion of
reads similar to their proportion in mixture
• Example– Mix of 3 species: ¼ Staph, ¼ Clost, ½ Ecoli– Say we get 4M reads– Then we expect about:
1M from Staph, 1M from Clost, 2M from Ecoli
![Page 28: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/28.jpg)
Meta-genome issues
• Closely related species– will have very similar reads
– lots of shared nodes in the graph
• Conserved sequence– bits of DNA common to lots of organisms
– “hub” nodes in the graph
• Untangling is difficult– need longer reads
![Page 29: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/29.jpg)
Transcriptomes
• RNA-Seq– first convert it into DNA (cDNA)
– represents a snapshot of RNA activity
• Expect proportionality– the expression level of a gene is proportional
to the number of reads from that gene’s cDNA
![Page 30: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/30.jpg)
Transcriptome issues
• Huge dynamic range– some gets lots of reads, some get none
• Splice variation– very similar, subtly different transcripts– lots of shared nodes in graph
![Page 31: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/31.jpg)
Meta-transcriptomes
• RNA-Seq– on multiple transcriptomes at once
• Expect proportional proportionality– proportion of that organism in mixture– proportions due to expression levels
• Meta x transcriptome issues combined!
![Page 32: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/32.jpg)
Assessing assemblies
• Genome assembly– Total length similar to genome size
– Fewer, larger contigs
– Correctness of contigs
• Metrics– Maximum contig length
– N50 (next slide)
![Page 33: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/33.jpg)
The “N50”
• “The length of that contig from which 50% of the bases are in it and shorter contigs”
• Imagine we got 7 contigs with lengths:– 1,1,3,5,8,12,20
• Total – 1+1+3+5+8+12+20 = 50
• N50 is the “halfway sum” = 25– 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12
![Page 34: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/34.jpg)
N50 concerns
• Optimizing for N50– encourages mis-assemblies!
• An aggressive assembler may over-join:– 1,1,3,5,8,12,20 (previous)– 1,1,3,5,20,20 (now)
– 1+1+3+5+20+20 = 50 (unchanged)
• N50 is the “halfway sum” (still 25)– 1+1+3+5+20= 30 (≥ 25) so N50 is 20
![Page 35: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/35.jpg)
Assembly tools
• Genome– Velvet, Abyss, Mira, Newbler, SGA, AllPaths,
Ray, Euler, SOAPdenovo, Edena, Arachne
• Meta-genome– MetaVelvet, SGA, custom scripts + above
• Transcriptome– Trans-Abyss, Oases, Trinity
• Meta-Transcriptome– custom scripts + above
![Page 36: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/36.jpg)
Example
• Culture your bacterium
• Extract your genomic DNA
• Send it to AGRF for Illumina sequencing– 100bp paired end
• Get back two files:– MRSA_R1.fastq.gz– MRSA_R2.fastq.gz
• Now what?
![Page 37: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/37.jpg)
Velvet: hash reads
velveth Dir31-fmtAuto-separateMRSA_R1.fastq.gzMRSA_R2.fastq.gz
New options
No interleaving required
![Page 38: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/38.jpg)
Velvet: assembly
velvetg
Dir
-exp_cov auto
-cov_cutoff auto
“Signal” level
“Noise” level
![Page 39: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/39.jpg)
Velvet: examine results
less Dir/contigs.fa
>NODE_1_length_43211_cov_27.36569AGTCGATGCTTAGAGAGTATGACCTTCTATACAAAAATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGATCTGATCTGATATGATCTTAGAGTATCGGCTATTGCTAGTCTCGCGTATAATAAATAATATATTTTTCTAATGATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGATCTGATCTGATATGATCTTAGAGTATCGGCTATTGCTAGTCTCGCGTATAATAAATAATATATTTAGTAGTCT …
![Page 40: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/40.jpg)
Velvet: GUI
Where to save
Click run
Add your reads
VelvetAssemblerGraphicalUserEnvironment
![Page 41: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/41.jpg)
![Page 42: Assembling NGS Data - IMB Winter School - 3 July 2012](https://reader030.vdocuments.mx/reader030/viewer/2022020123/55a8e1b71a28ab3f6a8b45ba/html5/thumbnails/42.jpg)
Contact
• Email– [email protected]
• Web– http://vicbioinformatics.com/– http://vlsci.org.au/
• Blog– http://TheGenomeFactory.blogspot.com