precomputing edit-distance specificity of short oligonucleotides
DESCRIPTION
Precomputing Edit-Distance Specificity of Short Oligonucleotides. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Polymerase Chain Reaction. Polymerase Chain Reaction. Primer Specificity. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/1.jpg)
Precomputing Edit-Distance
Specificity of Short
Oligonucleotides
Precomputing Edit-Distance
Specificity of Short
OligonucleotidesNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park
![Page 2: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/2.jpg)
2
Polymerase Chain Reaction
![Page 3: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/3.jpg)
3
Polymerase Chain Reaction
![Page 4: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/4.jpg)
4
Primer Specificity
• Need to ensure that primers hybridize to a specific (specified) locus only• Require exactly one occurrence of specified
sequence• Require no (potential) mis-hybridization loci
• Bottleneck computation in primer-design• Design / check iteration is problematic
![Page 5: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/5.jpg)
5
k-unique 20-mers
• Edit-distance as a surrogate for mis-hybridization potential
• k-unique loci:• All non-self genomic loci are require more
than k edits in (global) alignment• Closest non-self genomic loci requires
(k+1) edits in (global) alignment
![Page 6: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/6.jpg)
6
Find all k-unique 20-mers
• Naïve algorithm: O(n2km)• Quadratic in size of genome.
• 0-unique (exact match) 20-mers• (Expected) linear time algorithm
• Achieve expected linear time using a hybrid approach (blastn):• Use partial exact match to “seed” expensive
dynamic programming alignment• Large chunks ) Fast, but miss occurrences• Small chunks ) Slow, but correct
![Page 7: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/7.jpg)
7
Baeza-Yates Perleberg: • Correct and O(n) for small k
• At least 1 chunk is observed with no error.• Small k → Large chunks → Fast and correct• Largest correct chunk: floor(m/(k+1))
Inexact sequence match
≠ = ≠q
g
![Page 8: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/8.jpg)
8
Example worst case alignments
TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT
ACTTGTCCACAGTGCTTAAG||||||*||||||*||||||ACTTGTGCACAGTCCTTAAG
![Page 9: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/9.jpg)
9
Brute-force approach
ACTTGTGCACAGTCCTTAAG
AA:18AC:1,9AG:11,19CA:8,10CC:14CT:2,15
GC:7GT:5,12TA:17TC:13TG:4,6 TT:3,16
2-mer position table
![Page 10: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/10.jpg)
10
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
![Page 11: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/11.jpg)
11
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
![Page 12: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/12.jpg)
12
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
![Page 13: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/13.jpg)
13
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
![Page 14: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/14.jpg)
14
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
![Page 15: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/15.jpg)
15
Brute-force approach
ACTTGTGCACAGTCCTTAAG
ACTTGTGCACAGTCCTTAAG
![Page 16: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/16.jpg)
16
Brute-force approach
Divide the genome into 10 Mb blocksFor all pairs of blocks:
For all l-mer matches:Do all pair-wise DPs containing matchIf ≤ k edits, mark position non-unique
300 x 300 pairs of blocksFor 20-mers:
k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.
![Page 17: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/17.jpg)
17
Brute-force approach
Things are looking really, really, bad:• Seeds are too short• 90,000 pair-wise block comparisons
Actually quite good (seed size 12):• Non-uniqueness certificates are dense• Almost all positions eliminated early• Behaves more like linear time than
quadratic
![Page 18: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/18.jpg)
18
In practice (edit-dist 4)
![Page 19: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/19.jpg)
19
In practice (edit-dist 4)
![Page 20: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/20.jpg)
20
In practice (edit-dist 4)
![Page 21: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/21.jpg)
21
In practice (edit-dist 3)
![Page 22: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/22.jpg)
22
In practice (edit-dist 3)
![Page 23: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/23.jpg)
23
In practice (edit-dist 4,3,2)
![Page 24: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/24.jpg)
24
In practice (edit-dist 4,3,2)
![Page 25: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/25.jpg)
25
Edit distance 2
• After seed size 12• ~ 27K (0.288%) positions have no match
• After seed size 8• ~ 3K (0.029%) positions have no match
• Using seed size 6 is still too slow• Need a more sophisticated hashing strategy• 6-mers match in too many places!
![Page 26: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/26.jpg)
26
Spaced seed-set design problem
• Given:• mer-size: m ( = 20 )• # errors: k ( = 1,2,3)• # cares: l ( = 10,12,14 )
• Find the smallest set of spaced seeds that will find all alignments.
![Page 27: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/27.jpg)
27
Solution for (20,2,8)
• 11111111, 111101111
TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT
• How can we find these spaced seed set solutions?• One/two table? 2 x false positives
![Page 28: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/28.jpg)
28
Spaced seed set design set-cover formulation
• Set cover instance:• Ground set: all possible placements of the
k errors (alignments)• Covering sets: all possible placements of
the l care positions
• For (m=20,k=2,l=10),• 190 elements, 184,756 sets!• Need to reduce the number of sets!
![Page 29: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/29.jpg)
29
Dirty secret of spaced seeds
• Spaced seeds take O(# cares) to update!• Contiguous seeds are O(1) to update
• 101010101010101 vs 11111111• 8 steps to update vs 1 step to update
• Constant time update for spaced seeds?• Yes, if they have a certain structure(s)• Restrict spaced seeds to small update cost
![Page 30: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/30.jpg)
30
O(1) spaced seed update
ACGTACGTACGTACGTACGT1: A G A G2: C T C T ...Spaced seed 1010101 can be updated
in 1 step!
![Page 31: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/31.jpg)
31
O(1) spaced seed update
ACGTACGTACGTACGTACGTACGTACG -> ACGACG CGTACGT -> CGTCGT...Spaced seed 1110111 can be updated
in 1 step!
![Page 32: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/32.jpg)
32
O(1) spaced seed update
• “Period” step update• 11001100110011 2 steps• 1000010000100001 1 step
• “Runs” step update• 11100111111 1 step• 11101110111 2 steps
• Minimize the number of update steps• Weighted set-cover instance…
![Page 33: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/33.jpg)
33
Edit-distance SS-SDP
• Position of matching bases might shift!• Need 11111111 ↓ to get CCGCTAGA
• Need 111101111 ↑ to get CCGCTAGA
• Set cover formulation no longer works
TCCCGC-TAGATTGAGATCT||||||v||||||*||||||TCCCGCCTAGATTTAGATCT
![Page 34: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/34.jpg)
34
Edit-Distance SS-SDP
• Use a variation on set cover:• q:111101111,r:11111111 covers:
• Pay for query & reference update costs separately
• Control size of problem by only enumerating templates with small update cost
r:TCCCGC-TAGATTGAGATCT ||||||v||||||*||||||q:TCCCGCCTAGATTTAGATCT
![Page 35: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/35.jpg)
35
Correct solutions for 1-unique 20-mers107 random sequence x 107 random sequence
• Seed: 1111111111 (Best single seed solution, weight 10)• ~ 9.5 expensive dynamic programs per locus
• Seed set: 11111111111;111110111111 (weight 11)• ~ 8.9 DP/locus
• Seed set (weight 11)
• ~ 7.8 DP/locus
• Seed set: 111111111111;1111101111111 (weight 12)• ~ 2.2 DP/locus
11111111111 ~ 111101111111111101111111 ~ 11111111111111101111111 ~ 111101111111
![Page 36: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/36.jpg)
36
Correct solutions for 1-unique 20-mers
107 random sequence x 107 random sequence
• Seed set: 111111111111;1111101111111 (weight 12)• ~ 2.5 DP/locus
• Seed set (weight 12)
• ~ 1.8 DP/locus
• Seed set: 1111111111111;11111101111111 (weight 13)• ~ 0.56 DP/locus (same specificity as contiguous seed weight 12)
• Seed set (weight 14)• ~ 0.20
1111110111111 x 111111111111 1111110111111
111111001111111
11111111111111 ~ 11111111111111 111110111111111 ~ 11111111111111 111111111110111 ~ 11111111111111 111111111110111 ~ 111111111110111 11111111111111 ~ 111111101111111111110111111111 ~ 111110111111111
![Page 37: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/37.jpg)
Correct solutions for 2-unique 20-mers
• Seed: 111111 (Best single seed solution, weight 6)• ~ 2439 DP/locus
• Weight 10 – 73 DP/locus (specificity of 8/9 contig seed)
• Weight 12 – 10 DP/locus (specificity of 10 contig seed)
37
1111111111 ~ 1111111111 11111011111 ~ 1111111111 11111011111 ~ 11111011111 1111100000011111 ~ 1111100000011111 11111000000011111 ~ 1111100000011111 111110000000011111 ~ 1111100000011111 111110000011111 ~ 1111100000011111 111110000000011111 ~ 11111000000000011111
![Page 38: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/38.jpg)
38
k-unique human 20-mers
• No 4-unique 20-mers• No 3-unique 20-mers
• 0. 038% of (forward) human 20-mers are 2-unique• 1088322 in total• about 1 every 2638 bases• Fast 2-uniquness oracle
![Page 39: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/39.jpg)
Genome Browser Track
Edit Distance:
•UCSC Track: 1-unique 20-mers
•UCSC Track: 2-unique 20-mers
39
![Page 40: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/40.jpg)
40
Integration with High Performance Computing
• Break sequence into chunks of size 107.• Remember which positions have been eliminated.
• Integrated with (UMIACS) Condor• Too unreliable for very large sequences• NFS filesystem is unreliable• Simultaneous jobs capped at ~ 300
• Integrated with hadoop/map-reduce on 80 nodes (Edwards Lab)• Reliability improved, DFS solves (some) filesystem
woes• Much better scalability (in theory, yet to be tested)• Explicit synchronization of reduce step is undesirable.
![Page 41: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/41.jpg)
41
Other improvements
• Other groupings are possible:• Species designation on FASTA defline can be
any suitable partition• Constraints on the position of edits:
• False positive due to mishybridization at 3’ end is unlikely to be observed with some technologies
• Constraint on valid Tm range:• Computed as in Primer3• Can eliminate undesirable mers early
![Page 42: Precomputing Edit-Distance Specificity of Short Oligonucleotides](https://reader030.vdocuments.mx/reader030/viewer/2022032313/56812b3b550346895d8f4e3f/html5/thumbnails/42.jpg)
42
Conclusions
• Precompute of human k-unique 20-mers is now feasible!• Faster for large edit-distance!• Need spaced seed-set designs
• Constant time update for spaced seeds• Good integer programming formulation of SS-
SDP• Limited template enumeration based on update cost• Work with integer programming experts to solve
effectively