de novo genome assembly - imb winter school - 7 july 2015
TRANSCRIPT
De novo genome assembly
A/Prof Torsten Seemann
IMB Winter School – Brisbane, Australia – Tue 7 July 2015
Ideal world
I would not need to give this talk!
AGTCTAGGATTCGCTACAGATTCAGGCTCTGAAGCTAGATCGCTATGCTATGATCTAGATCTCGAGATTCGTATAAGTCTAGGATTCGCTATAGATTCAGGCTCTGATATAT
Human DNA iSequencer™
46 complete haplotype
chromosome sequences
Real world
• Can’t sequence full-length native DNA– no instrument yet
• But we can sequence short fragments– 100 at a time (Sanger)– 100,000 at a time (Roche 454)– 1,000,000 at a time (PGM)– 10,000,000 at a time (Proton, MiSeq)– 100,000,000 at a time (HiSeq)
De novo assembly
The process of reconstructingthe original DNA sequencefrom the fragment reads alone.
Instinctively like a jigsaw puzzle– Find reads which “fit together” (overlap)– Could be missing pieces (sequencing bias)– Some pieces will be dirty (sequencing errors)
A small “genome”
Friends, Romans, countrymen, lend me your ears;
I’ll return them
tomorrow!
I’ll return them
tomorrow!
Shakespearomics
• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me
Whoops! I dropped
them.
Whoops! I dropped
them.
Shakespearomics
• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me
• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;
I am good with
words.
I am good with
words.
Shakespearomics
• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me
• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;
• Majority consensusFriends, Romans, countrymen, lend me your ears;
We have a consensus
!
We have a consensus
!
Approaches
• greedy assembly
• seed and extend
• de Bruijn graphs
• string graphs
… all essentially doing the same thing,but taking different short cuts.
Find read overlaps
• If we have N reads of length L– we have to do ½N(N-1) ~ O(N²) comparisons– each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!
• What counts as “overlapping” ?– minimum overlap length eg. 20bp– minimum %identity across overlap eg. 95%– choice depends on L and expected error rate
What ruins the graph?
• Read errors– introduce false edges and nodes
• Heterozygosity– causes lots of detours from homozygous areas
• Repeats– causes nodes to be shared, locality confusion
Graph simplification
• Squash bubbles– collapse small errors (or minor heterozygosity)
• Remove spurs– short “dead end” hairs on the graph
• Join unambiguously connected nodes– reliable stretches of unique DNA
Graph traversal
• For each unconnected graph– at least one per replicon in original sample
• Find a path which visits each node once– Hamiltonian path/cycle is NP-hard (this is bad)– solution will be a set of paths which terminate at
decision points
• Form a consensus sequences from paths– use all the overlap alignments – each of these collapsed paths is a contig
Contigs
Contiguous, unambiguous stretches of assembled DNA sequence
• Contigs ends correspond to– Real ends (for linear DNA molecules)– Dead ends (missing sequence)– Decision points (forks in the road)
What is a repeat?
A segment of DNAwhich occurs more than once in the genome sequence
• Very common– Transposons (self replicating genes)– Satellites (repetitive adjacent patterns)– Gene duplications (paralogs)
The law of repeats
• It is impossible to resolve repeats of length S unless you have reads longer than S.
• It is impossible to resolve repeats of length S unless you have reads longer than S.
Beyond contigs
Contig sizes are limited by:
• the length of repeats in your genome– can’t change this!
• the length (or “span”) of the reads– wait for new technology– use “tricks” with existing technology
base
pairs
Paired reads
• DNA fragment (200-800 bp)==============================
• Single end-------->=====================
• Paired end (up to 800 bp span)----->==================<-----
• Mate pair (up to 40 kbp span)---->========/+/=========<----
Assessing assemblies
• We desire– Total length similar to genome size
– Fewer, larger contigs
– No mistakes (mis-assemblies)
• Metrics– No generally useful objective measure
– Longest, total bp, genes recovered, …
The “N50”
The length of that contig from which 50% of the bases are in it and shorter contigs
• Imagine we got 7 contigs with lengths:– 1,1,3,5,8,12,20
• Total – 1+1+3+5+8+12+20 = 50
• N50 is the “halfway sum” = 25– 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12
N50 concerns
• Optimizing for N50– encourages mis-assemblies!
• An aggressive assembler may over-join:– 1,1,3,5,8,12,20 (previous)– 1,1,3,5,20,20 (now)– 1+1+3+5+20+20 = 50 (unchanged)
• N50 is the “halfway sum” (still 25)– 1+1+3+5+20= 30 (≥ 25) so N50 is 20 (was 12)
Validation
• Self consistency– Align read back to contigs– Check for errors or discordant pairs
• Second opinion– Use two complementary sequencing methods– Target troublesome areas for PCR– Use a genome wide “optical map”
Considerations
• Size of genome– virus, bacteria, eukaryote, meta-genome
• Hardware– phone, laptop, desktop, server, cloud– RAM is more limiting than CPU
• Operating system– Linux, Mac, Windows
• Software budget– commercial, free, open-source
Recommendations
• SPAdes– Unix command-line (Mac, Linux)
• VAGUE (Velvet)– Unix GUI (Mac, Linux)
• CLC Genomics Workbench– Java GUI (Windows, Mac, Linux)– Commercial product
Online tutorial
• The GVL– Genomics Virtual Laboratory– http://genome.edu.au
• Protocols– Microbial de novo assembly for Illumina data– Written by Simon Gladman (VLSCI)– https://genome.edu.au/wiki/Protocols