cs 394c march 19, 2012
DESCRIPTION
CS 394C March 19, 2012. Tandy Warnow. 2. Shotgun DNA Sequencing. DNA target sample. SHEAR & SIZE. End Reads / Mate Pairs. 550bp. Not all sequencing technologies produce mate-pairs. Different error models Different read lengths. 10,000bp. Basic Challenge. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/1.jpg)
CS 394C March 19, 2012
Tandy Warnow
![Page 2: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/2.jpg)
2
![Page 3: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/3.jpg)
DNA target sampleDNA target sampleSHEAR & SIZESHEAR & SIZE
End Reads / Mate PairsEnd Reads / Mate Pairs
550bp
10,000bp
Shotgun DNA Sequencing
Not all sequencing technologies produce mate-pairs. Different error modelsDifferent read lengths
![Page 4: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/4.jpg)
Basic Challenge
• Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome
• Issues: – Coverage– Errors in reads– Reads vary from very short (35bp) to quite long
(800bp), and are double-stranded– Non-uniqueness of solution– Running time and memory
![Page 5: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/5.jpg)
Simplest scenario
• Reads have no error• Read are long enough that each
appears exactly once in the genome• Each read given in the same orientation
(all 5’ to 3’, for example)
![Page 6: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/6.jpg)
De novo vs. comparative assembly
• De novo assembly means you do everything from scratch
• Comparative assembly means you have a “reference” genome. For example, you want to sequence your own genome, and you have Craig Venter’s genome already sequenced. Or you want to sequence a chimp genome and you have a human already sequenced.
![Page 7: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/7.jpg)
Comparative Assembly
Much easier than de novo!Basic idea: • Take the reads and map them onto the
reference genome (allowing for some small mismatch)
• Collect all overlapping reads, produce a multiple sequence alignment, and produce consensus sequence
![Page 8: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/8.jpg)
Comparative Assembly
FastShort reads can map to several places
(especially if they have errors)Needs close reference genomeRepeats are problematicCan be highly accurate even when reads have
errors
![Page 9: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/9.jpg)
De Novo assembly
• Much easier to do with long reads• Need very good coverage• Generally produces fragmented
assemblies• Necessary when you don’t have a
closely related (and correctly assembled) reference genome
![Page 10: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/10.jpg)
10
De Novo Assembly paradigms
• Overlap Graph:overlap-layout-consensus methods– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)
• k-mer graph (especially useful for assembly from short reads)
![Page 11: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/11.jpg)
Overlap Graph
• Each read is a node• There is a directed edge from u to v if
the two reads have sufficient overlap• Objective: Find a Hamiltonian Path (for
linear genomes) or a Hamiltonian Circuit (for circular genomes)
![Page 12: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/12.jpg)
12
Paths through graphs and assembly
• Hamiltonian circuit: visit each node (city) exactly once, returning to the start
A
B D C
E
H G
I
F
A
B
C
D H
I
F
G E
Genome
![Page 13: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/13.jpg)
Hamiltonian Path Approach
• Hamiltonian Path is NP-hard (but good heuristics exist), and can have multiple solutions
• Dependency on detecting overlap (errors in reads, overlap length)
• Running time (all-pairs overlap calculation)• Repeats• Tends to produce fragmented assemblies
(contigs)
![Page 14: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/14.jpg)
Example Reads: • TAATACTTAGG• TAGGCCA• GCCAGGAAT • GAATAAGCCAA• GCCAATTT• AATTTGGAAT• GGAATTAAGGCAC• AGGCACGTTTA• CACGTTAGGACCATT• GGACCATTTAATACGGAT
If minimum overlap is 3, what graph do we get?If minimum overlap is 4, what graph do we get?If minimum overlap is 5, what graph do we get?
![Page 15: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/15.jpg)
15
Overlap-layout-consensusMain entity: readRelationship between reads: overlap
12
3
45
6
78
9
1 2 3 4 5 6 7 8 9
1 2 3
1 2 3
1 2 3 12
3
1 3
2
13
2
ACCTGAACCTGAAGCTGAACCAGA
![Page 16: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/16.jpg)
16
TIGR Assembler/phrap
Greedy
• Build a rough map of fragment overlaps
• Pick the largest scoring overlap
• Merge the two fragments• Repeat until no more merges
can be done
![Page 17: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/17.jpg)
Challenges for Overlap-Layout-Consensus
• Computing all-pairs overlap is computationally expensive, especially for NGS datasets, which can have millions of short reads.
• (The Hamiltonian Path part doesn’t help either)
• This computational challenge is increased for large genomes (e.g., human)
• Need something faster!
![Page 18: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/18.jpg)
k-mer graph (aka de Bruijn graph)
• Vertices are k-mers that appear in some read, and edges defined by overlap of k-1 nucleotides
• Small values of k produce small graphs
• Does not require all-pairs overlap calculation!• But: loss of information about reads can lead to
“chimeric” contigs, and incorrect assemblies
• Also produces fragmented assemblies (even shorter contigs)
![Page 19: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/19.jpg)
Eulerian Paths• An Eulerian path is one that goes through every edge exactly
once
• It is easy to see that if a graph has an Eulerian path, then all but 2 nodes have even degree. The converse is also true, but a bit harder to prove.
• For directed graphs, the cycle will need to follow the direction of the edges (also called “arcs”). In this case, a graph has an Eulerian path if and only if the indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.
![Page 20: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/20.jpg)
de Bruijn Graphs are Eulerian
• If the k-mer set comes from a sequence and every k-mer appears exactly once in the sequence,
• then the de Bruijn graph has an Eulerian path!
![Page 21: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/21.jpg)
de Bruijn Graph
• Create the de Bruijn graph for the following string, using k=5– ACATAGGATTCAC
• Find the Eulerian path• Is the Eulerian path unique?• Reconstruct the sequence from this path
![Page 22: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/22.jpg)
Using de Bruijn Graphs
Given: set of k-mers from a DNA sequence
Algorithm: • Construct the de Bruijn graph• Find an Eulerian path in the graph• The path defines a sequence with the
same set of k-mers as the original
![Page 23: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/23.jpg)
Using de Bruijn Graphs
Given: set of k-mers from a set of reads for a sequence
Algorithm: • Construct the de Bruijn graph• Try to find an Eulerian path in the graph
![Page 24: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/24.jpg)
No matter what
• Because of – Errors in reads– Repeats– Insufficient coveragethe overlap graphs and de Bruijn graphs generally
don’t have Hamiltonian paths/circuits or Eulerian paths/circuits
• This means the first step doesn’t completely assemble the genome
![Page 25: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/25.jpg)
Reads, Contigs, and Scaffolds
• Reads are what you start with (35bp-800bp)
• Fragmented assemblies produce contigs that can be kilobases in length
• Putting contigs together into scaffolds is the next step
![Page 26: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/26.jpg)
Consensus (15-30Kbp)Consensus (15-30Kbp)
ReadsReads
ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.
??Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.
2-pair2-pair
Mean & Std.Dev.Mean & Std.Dev.is knownis known
ScaffoldScaffold
Mate Pairs Give Order & Orientation
![Page 27: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/27.jpg)
27
Handling repeats1. Repeat detection
– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)
– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)
– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat
![Page 28: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/28.jpg)
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
AA
BBimpliesimplies
AA
BB
TRUE
OROR
AA BB
REPEAT-INDUCED
Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch. Trim & ScreenTrim & Screen
![Page 29: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/29.jpg)
Repeat Rez I, IIRepeat Rez I, II
Assembly PipelineCompute all overlap consistent sub-assemblies:Compute all overlap consistent sub-assemblies:
Unitigs (Uniquely Assembled Contig)
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Trim & ScreenTrim & Screen
![Page 30: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/30.jpg)
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Scaffold U-unitigs with confirmed pairsScaffold U-unitigs with confirmed pairs
Mated readsTrim & ScreenTrim & Screen
![Page 31: CS 394C March 19, 2012](https://reader034.vdocuments.mx/reader034/viewer/2022051003/5681682d550346895dddc6ee/html5/thumbnails/31.jpg)
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs
Unitig>0Unitig>0
Trim & ScreenTrim & Screen