cs5263 bioinformatics lecture 15 & 16 exact string matching algorithms
TRANSCRIPT
![Page 1: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/1.jpg)
CS5263 Bioinformatics
Lecture 15 & 16
Exact String Matching Algorithms
![Page 2: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/2.jpg)
Definitions
• Text: a longer string T• Pattern: a shorter string P• Exact matching: find all occurrence of P in T
abayababaxababb abayababaxababb
aba aba
T
P
length m
length n
![Page 3: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/3.jpg)
The naïve algorithm
abayababaxababb abayababaxababb
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
![Page 4: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/4.jpg)
Time complexity
• Worst case: O(mn)• Best case: O(m)
– aaaaaaaaaaaaaa vs baaaaaaa
• Average case?– Alphabet A, C, G, T– Assume both P and T are random– Equal probability– How many chars do you need to compare
before moving to the next position?
![Page 5: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/5.jpg)
Average case time complexity
P(mismatch at 1st position): ¾P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾P(mismatch at kth position): (¼)k-1 * ¾Expected number of comparison per position:p = 1/4
k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3
Average complexity: 4m/3Not as bad as you thought it might be
![Page 6: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/6.jpg)
Biological sequences are not random
T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaab
Plus: 4m/3 average case is still bad for long genomic sequences!
Especially if P is not in T…
Smarter algorithms:O(m + n) in worst casesub-linear in practice
![Page 7: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/7.jpg)
String matching scenarios
• One T and one P– Search a word in a document
• One T and many P all at once– Search a set of words in a document– Spell checking
• One fixed T, many P– Search a completed genome for a short
sequence
• Two (or many) T’s for common patterns
![Page 8: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/8.jpg)
How to speedup?
• Pre-processing T or P• Why pre-processing can save us time?
– Uncovers the structure of T or P– Determines when we can skip ahead without missing
anything– Determines when we can infer the result of character
comparisons without doing them.
ACGTAXACXTAXACGXAX
ACGTACA
![Page 9: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/9.jpg)
Cost for exact string matching
Total cost = cost (preprocessing)
+ cost(comparison)
+ cost(output)
Constant
Minimize
Overhead
Hope: gain > overhead
![Page 10: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/10.jpg)
Which string to preprocess?
• One T and one P– Preprocessing P?
• One T and many P all at once– Preprocessing P or T?
• One fixed T, many P (unknown)– Preprocessing T?
• Two (or many) T’s for common patterns– ???
![Page 11: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/11.jpg)
Pattern pre-processing algs
– Karp – Rabin algorithm• Small alphabet and small pattern
– Boyer – Moore algorithm• the choice of most cases• Typically sub-linear time
– Knuth-Morris-Pratt algorithm (KMP)• grep
– Aho-Corasick algorithm• fgrep
![Page 12: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/12.jpg)
Karp – Rabin Algorithm
• Let’s say we are dealing with binary numbersText: 01010001011001010101001
Pattern: 101100
• Convert pattern to integer101100 = 2^5 + 2^3 + 2^2 = 44
![Page 13: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/13.jpg)
Karp – Rabin algorithm
Text: 01010001011001010101001Pattern: 101100 = 44 decimal
10111011001010101001= 2^5 + 2^3 + 2^2 + 2^1 = 4610111011001010101001= 46 * 2 – 64 + 1 = 2910111011001010101001= 29 * 2 - 0 + 1 = 5910111011001010101001= 59 * 2 - 64 + 0 = 5410111011001010101001= 54 * 2 - 64 + 0 = 44
![Page 14: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/14.jpg)
Karp – Rabin algorithmWhat if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bitsBasic idea: hashing. 44 % 13 = 5
10111011001010101001= 46 (% 13 = 7)10111011001010101001= 46 * 2 – 64 + 1 = 29 (% 13 = 3)10111011001010101001= 29 * 2 - 0 + 1 = 59 (% 13 = 7)10111011001010101001= 59 * 2 - 64 + 0 = 54 (% 13 = 2)10111011001010101001= 54 * 2 - 64 + 0 = 44 (% 13 = 5)
![Page 15: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/15.jpg)
Boyer – Moore algorithm
• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule
![Page 16: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/16.jpg)
Boyer – Moore algorithm
• Right to left comparison
x
y
y
Skip some chars without missing any occurrence.
But how?
![Page 17: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/17.jpg)
Bad character rule
0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^What would you do now?
![Page 18: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/18.jpg)
Bad character rule
0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^P: tpabxab
![Page 19: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/19.jpg)
Bad character rule
0 1 123456789012345678T:xpbctbxabpqqaabpqzP: tpabxab *^^^^P: tpabxab *P: tpabxab
![Page 20: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/20.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
tpabxab
Pre-processing:O(n)
![Page 21: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/21.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqzP: tpabxab
*^^^^
P: tpabxab
When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P
k
i = 3 Shift 3 – 1 = 2
![Page 22: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/22.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqzP: tpabxab *
P: tpabxab
When T(k) is not in P, shift left end of P to align with T(k+1)
k
i = 7 Shift 7 – 0 = 7
![Page 23: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/23.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqz
P: tpabxab *^^
P: tpabxab
When rightmost T(k) in P is right to i, shift pattern P one pos
k
i = 5 5 – 6 < 0. so shift 1
![Page 24: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/24.jpg)
Extended bad character rule
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
T: xpbctbxabpqqaabpqz
P: tpabxab *^^
P: tpabxab
Find T(k) in P that is immediately left to i, shift P to align T(k) with that position
k
i = 5 5 – 3 = 2. so shift 2
Preprocessing still O(n)
![Page 25: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/25.jpg)
Extended bad character rule
• Best possible: m / n comparisons
• Works better for large alphabet size
• In some cases the extended bad character rule is sufficiently good
• Worst-case: O(mn)
![Page 26: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/26.jpg)
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
According to extended bad character rule
![Page 27: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/27.jpg)
(weak) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
![Page 28: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/28.jpg)
(Weak) good suffix rule
tx
tyt’
tyt’
In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’
T
P
P
![Page 29: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/29.jpg)
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
![Page 30: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/30.jpg)
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
![Page 31: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/31.jpg)
(Strong) good suffix rule
• Pre-processing can be done in linear time• If P in T, may take O(mn)• If P not in T, worst-case O(m+n)
tx
tyt’
tyt’
In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’
T
P
P
z
z
![Page 32: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/32.jpg)
Lessons From B-M
• Sub-linear time is possible– But we still need to read T from disk!
• Bad cases require periodicity in P or T– matching random P with T is easy!
• Large alphabets mean large shifts• Small alphabets make complicated shift
data-structures possible• B-M better for “english” and amino-acids
than for DNA.
![Page 33: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/33.jpg)
Algorithm KMP
• Not the fastest
• Best known
• Good for multiple pattern matching and real-time matching
• Idea– Left-to-right comparison– Shift P more chars when possible
![Page 34: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/34.jpg)
Basic idea
tt’P
t xT
y
tt’P y
z
z
In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1].Sp’(i) = length(t)
![Page 35: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/35.jpg)
Example
P: aataac
a a t a a c
Sp’(i) 0 1 0 0 2 0
aaat
aataac
![Page 36: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/36.jpg)
Failure link
P: aataac
a a t a a c
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If a char in T fails to match at pos 6, re-compare it with the
char at pos 3
![Page 37: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/37.jpg)
FSA
P: aataac
1 2 3 4 50a a t a a c
6
a
t
All other input goes to state 0
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If the next char in T is t, we go to state 3
![Page 38: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/38.jpg)
Another example
P: abababc
a b a b a b c
Sp’(i) 0 0 0 0 0 4 0
abab
abababab
ababaababc
![Page 39: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/39.jpg)
Failure link
P: abababc
a b a b a b c
Sp’(i) 0 0 0 0 0 4 0
ababaababc
If a char in T fails to match at pos 7, re-compare it with
the char at pos 5
![Page 40: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/40.jpg)
FSA
P: abababc
1 2 3 4 5 6
Sp’(i) 0 0 0 0 0 4 0
ababaababc
If the next char in T is a, go to state 5
0a b a b a c
7b
a
All other input goes to state 0
![Page 41: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/41.jpg)
Difference between Failure Link and FSA?
• Failure link– Preprocessing time and space are O(n),
regardless of alphabet size– Comparison time is at most 2m
• FSA– Preprocessing time and space are O(n ||)
• May be a problem for very large alphabet size
– Comparison time is always m.
![Page 42: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/42.jpg)
Failure link
P: aataac
a a t a a c
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If a char in T fails to match at pos 6, re-compare it with the
char at pos 3
![Page 43: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/43.jpg)
Example
a a t a a c
aataac^^*
T: aacaataaaaataaccttacta
aataac.*aataac^^^^^*
aataac..*aataac.^^^^^
Each char in T may be compared multiple times. Up to n.
Time complexity: O(2m).
Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m.
![Page 44: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/44.jpg)
Example
T: aacaataaaaataaccttacta
Each char in T will be examined exactly once.
Therefore, exact m comparisons are needed.
Takes longer to do pre-processing.
1 2 3 4 50a a t a a c
6
a t
1201234501234560001001
![Page 45: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfec1a28abf838cb8487/html5/thumbnails/45.jpg)
How to do pre-processing?