read mapping - carnegie mellon school of computer...
TRANSCRIPT
![Page 1: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/1.jpg)
Read MappingSlides by Carl Kingsford
![Page 2: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/2.jpg)
BowtieUltrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology 2009, 10:R25
![Page 3: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/3.jpg)
Bowtie Features
• Extends basic BWT search to handle mismatches
• Will find an exact match if it exists.
• Might not find the best inexact match
• Two innovations:
• quality-aware backtracking
• double indexing
![Page 4: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/4.jpg)
SOAP-like Policy (-v N)Don’t allow more than N mismatches.
Sequence (end to start) Size of range
in BWT search
A
A
C
T
G ∅
G
Repeat this until you hit 175 backtracks, or you find K matches (K is a parameter).
N ∈ {0,1,2,3}
![Page 5: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/5.jpg)
SOAP-like Policy (-v N)Don’t allow more than N mismatches.
Sequence (end to start) Size of range
in BWT search
A
A
C
T
G ∅
GC
T
G
G
Make a substitution on a path with < N
mismatches !
X → Y must generate a non-empty BWT
range. !
If several candidates, pick one with lowest
quality score.
Repeat this until you hit 175 backtracks, or you find K matches (K is a parameter).
N ∈ {0,1,2,3}
![Page 6: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/6.jpg)
Using Forward and Mirror Index
X
X
Use a BWT of both forward and reverse of string:
If mutation in first half of read, then the forward index can walk at least |read| / 2 before backtracking.
If mutation in second half of read, then the second index can walk at least |read| / 2 before backtracking.
Try both the forward and reverse indexes; will avoid a lot of backtracking because you will have narrowed the BWT range a lot by the time you start backtracking.
![Page 7: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/7.jpg)
Maq-like (default) Approach
5’ 3’“seed”
(default -l 28nt)
-n N mismatches allowed in seed
Total quality score of mismatched bases
must be < -e Q
N ∈ {0,1,2,3}
Up to N mismatches allowed in “seed”, which is prefix of the read
![Page 8: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/8.jpg)
3 Phases
≥ 00-20
seed
phase 1 (using mirror index)
01-2
phase 2 (using forward index)
phase 3a (using mirror index), extending phase 2 alignments outside of seed
≥ 011
phase 3b (using mirror index)
phase 2 ≥ 0
1-20
01-2
11
00
4 possible cases for distribution of 0-2 mutations within the seed:
Same backtracking scheme used to handle mismatches
![Page 9: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/9.jpg)
Bowtie Performance
![Page 10: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/10.jpg)
Bowtie 2
![Page 11: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/11.jpg)
Features
• Supports gapped alignment
• Uses bi-directional BWT instead of two separate BWTs
• Supports paired-end alignment
• Align one end as normal
• Find the window where the other end could go
• Do dynamic programming alignment step to align the other end of the pair within this window
![Page 12: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/12.jpg)
![Page 13: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/13.jpg)
priority of row i = 1/ ri2, where ri is the # of sequences in the range containing row i.
![Page 14: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/14.jpg)
Bi-directional BWT
P c QStandard “forward” BWT search:
Given range for Q, can find in constant time the range for Qc.
(Use LF mapping, etc.)
P cQ
BW matrix ranges
Thm. Given range(Q), reverseRange(Q), can find in O(|∑|)-time: • range(Qc) • range(cQ) • reverseRange(cQ) • reverseRange(Qc)
easy (this is the standard BWT search)
easy (this is the standard BWT search applied to the reverse string)
![Page 15: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/15.jpg)
Computing Range(Qc)
range(Q)
s
e
BWT matrixx = # of prefixes
Qd with d < c
range(Qc) = [s + x, s+ x + y - 1]
y = # of prefixes Qc
![Page 16: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/16.jpg)
Computing Range(Qc)
range(Q)
s
e
BWT matrixx = # of prefixes
Qd with d < c
range(Qc) = [s + x, s+ x + y - 1]
y = # of prefixes Qc
x = ∑d reverseRange((Qd)R) = reverseRange(dQR)
because the size of the range of Qc in T = the size of the range of (Qc)R in TR
![Page 17: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/17.jpg)
Computing Range(Qc)
range(Q)
s
e
BWT matrixx = # of prefixes
Qd with d < c
range(Qc) = [s + x, s+ x + y - 1]
y = # of prefixes Qc
y = reverseRange((Qc)R)
x = ∑d reverseRange((Qd)R) = reverseRange(dQR)
because the size of the range of Qc in T = the size of the range of (Qc)R in TR
![Page 18: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/18.jpg)
Computing Range(Qc)
range(Q)
s
e
BWT matrixx = # of prefixes
Qd with d < c
range(Qc) = [s + x, s+ x + y - 1]
y = # of prefixes Qc
y = reverseRange((Qc)R)
x = ∑d reverseRange((Qd)R) = reverseRange(dQR)
because the size of the range of Qc in T = the size of the range of (Qc)R in TR
Can extend reverseRange(Q) to reverseRange(dQ) with 1 BWT LF step
![Page 19: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/19.jpg)
Accuracy Results
![Page 20: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/20.jpg)
TopHatCole Trapnell, Lior Pachter and Steven L. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009) 25 (9): 1105-1111.
![Page 21: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/21.jpg)
TopHat
• Find Exons:
• Map reads
• Assemble exons (using “Maq”)
• Extend exons by 45bp
• Join exons closer than 6bp
• Find splices:
• Find every pair of donor/acceptor sites that are close enough
• Look at 2kmers spanning that junction (where k appears on each side; k=5)
• Use a 2k-mer lookup table to find those reads
![Page 22: Read Mapping - Carnegie Mellon School of Computer Scienceckingsf/class/02-714/Lec24-readalign.pdfSOAP-like Policy (-v N) Don’t allow more than N mismatches. Sequence (end to start)](https://reader034.vdocuments.mx/reader034/viewer/2022052019/60336541a5508a3086672744/html5/thumbnails/22.jpg)
TopHat Approach
(Trapnell et al., 2009)