faster algorithms for string matching with k mismatches
DESCRIPTION
Faster algorithms for string matching with k mismatches. Adviser : R. C. T. Lee Speaker: C. C. Yen. Journal of Algorithms , Volume 50, Issue 2 , February 2004 , Pages 257-275 Amihood Amir, Moshe Lewenstein and Ely Porat. String matching with k mismatches. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/1.jpg)
1
Faster algorithms for string matching with k mismatches
Adviser : R. C. T. Lee
Speaker: C. C. Yen
Journal of Algorithms, Volume 50, Issue 2, February 2004, Pages 257-275Amihood Amir, Moshe Lewenstein and Ely Porat
![Page 2: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/2.jpg)
2
String matching with k mismatches
Input: A text T with length n , a pattern P with length m and a mismatching threshold k
Output: Each sub-string S of T where HD(S,P) k
![Page 3: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/3.jpg)
3
The basic idea of following algorithms
The authors discuss the number of distinct symbols in the pattern and design algorithms to solve the problems efficiently in different cases.
Example:
P = ACAABD
The number of distinct symbols of P is 4.
![Page 4: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/4.jpg)
4
Three cases of the number of distinct symbols in pattern
The paper discusses the following three cases; k is the maximal number of mismatches allowed.
1. There are at least 2k distinct symbols.
2. There are less than distinct symbols.
3. The number of distinct symbols is between and 2k.
k2
k2
![Page 5: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/5.jpg)
5
Case 1: At least 2k distinct symbols
There are two stages in the algorithm.
1. Marking
Identify potential starts of the pattern and do a crude pruning of the potential candidates.
2. Verification
Verify which of the potential candidates is indeed a pattern which occurs.
In this case, the algorithm takes linear time to solve string matching with k mismatches problem.
![Page 6: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/6.jpg)
6
The basic idea of this paper is as follow:
1) Let A={a1, a2…a2k} be a set of distinct alphabets appearing in P.
2) Let P’ be the shortest prefix of P containing A.
3) Let the length of P’ be C.
4) Let S be a substring of T of length C.
5) Suppose among the 2k distinct alphabets in A which also appear in S , there are d matches between P’and S , as shown below:
6) Then, obviously, among 2k locations in P’ ,there are 2k-d mismatches.
7) If , then , we may ignore S totally. kd
Cd matches
S
kk d-2
P’
![Page 7: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/7.jpg)
7
But, how can we determine d ?
We may use a position table
![Page 8: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/8.jpg)
8
Marking stage of Case1 Let{a1….,, a2k}be 2k different alphabet symbols appearing in th
e pattern and let ij be the smallest index in the pattern where aj appears ,j=1….,2k.
Create a position table S1 … S2k to represent distinct symbols in pattern P and pos0 … pos2k are their first appearance locations on P.
Examplesymbols A C B D
pos 0 1 3 4
S0 S1 S2 S3
pos0 pos1 pos2 pos3
T = ACBBDACTADIKQDABD….
= T0 … Tn-1
P = ACABDAE
k = 2
0 1 2 3 4 5 6
![Page 9: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/9.jpg)
9
We need scan the text T for each ti, , if we can find a j, , such that ti=sj , add 1 to location i - posj of an array X. If i – posj is less than 0, we ignore it. X is an array with size n and all elements of X are 0 initially .
ni 0kj 20
4310pos
DBCAsymbols
S0 S1 S2 S3
pos0 pos1 pos2 pos3
S0 … S3 represent 2k distinct symbols in pattern P and pos0 … pos3 are their first appearance locations on P.
T = ACBBDACTADIKQDABD….
= T0 … Tn-1
P = ACABDAE
k = 2
0 1 2 3 4 5 6
X = 00000000000000000….
![Page 10: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/10.jpg)
10
After the scanning is completed, we obtain the following array :
X=4 0 00 0 3 00 1 100 0 0 0 00
For every X(a)=b, we know that there are b matches 2k distinct character between T(a, a+c-1) and P(0, c-1) . There are at least 2k-b mismatches .Since b<k, 2k-b>k. We may ignore T(a,a+c-1) in our case, since
X=4 0 0 0 0 3 0 0 1 1 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 1213141516
We need to examine only T(0,4) and T(5,9).We ignore all other substrings
![Page 11: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/11.jpg)
11
Lemma 1
For Case 1, let n denote the length of text and k be maximal number of mismatches allowed. There are at most n/k candidate locations.
Proof :
The total number of addition to the X array is at most n because the algorithm tests T(i) , i=1,2….n .
Let the number of locations whose numbers are larger than k be a
Then
nak
k
na
![Page 12: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/12.jpg)
12
Through Lemma 1, we know that at most n/k candidate locations remain.
But not all candidate locations are starting points of matches with k maximal number of mismatches.
P = ACABDAET = ACBBDACTADIKQDABD….X = 40000300000000000….
There are four other mismatches, so the candidate location is not a starting point of match with k maximal number of mismatches.
Take T(5) as an example:
![Page 13: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/13.jpg)
13
We must verify which candidate locations are starting points of matches with k maximal number of mismatches.
![Page 14: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/14.jpg)
14
Verification stage of Case1
The authors use the Kangaroo Method to verify whether a location has k maximal number of mismatches in O(k).
T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…
P = ETBDBCCDFDC
We shall not elaborate on this method because it was presented before
![Page 15: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/15.jpg)
15
Time complexity of Case 1
We take O(n) time in marking stage, where n is the length of the text.
According to Lemma 1, we have at most n/k candidate locations.
Using Kangaroo method, we take O(k) time to verify a remained candidate location.
Thus, we take O(n) time for the verification stage.
![Page 16: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/16.jpg)
16
Case 2: Less than distinct symbols
We can use the Boolean Convolution method to solve the problem for this case.
k2
![Page 17: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/17.jpg)
17
Thus, it is obvious that Hamming distance can be found by convolution
Let A=abac and B=acdc For this case, HD(A,B)=2
Convolution: a b a c
c d c a
1 0 1 0
0 0 0 1
0 0 0 0
0 0 0 1
0 0 0 2 0 2 0
2 matches HD(A,B)=2
![Page 18: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/18.jpg)
18
Using Fast Fourier Transforms (FFT), Boolean Convolution can be done in O(nlogm).
Our alphabet size is
We take times to solve the problem for Case 2.
)( kO
)log( mknO
![Page 19: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/19.jpg)
19
Case 3: The number of distinct symbols is between and 2kk2
Definition:
frequent symbol: A symbol appears in the pattern at least times.k2
k = 2, , P = baccdbdd
d is a frequent symbol.
32 k
Example
![Page 20: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/20.jpg)
20
Two Sub-cases of Case 3
Case3-1: There are at least frequent symbols in the pattern.
Case3-2: There are less than frequent symbols in pattern.
k
k
![Page 21: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/21.jpg)
21
Case 3-1:at least frequent symbolsk
There are two stages in the algorithm for this case.
(1)Marking stage
Identify potential starts of the pattern and do a crude pruning of the potential candidates.
(2)Verification stage
Verify which of the potential candidate is indeed a pattern which occurs.
Verification stage will be done by Kangaroo Method.
![Page 22: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/22.jpg)
22
Example
Let
P = ABCAABBDBAA
and k = 4
There are 4 ( 4 is between and 2k) distinct symbols in P and ‘A’, ‘B’ are frequent symbols. There are 2 (= )frequent symbols.
Marking stage of Case 3-1
We pick arbitrarily frequent symbols and convert this problem to mismatch problem with “don’t care” .
k
k2
k
T = ABCABDCABBCFADDABCT = ABCABDCABBCFADDABCT = ABCABDCABBCFADDABC
![Page 23: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/23.jpg)
23
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted.
The number of mismatches
A B Φ A A B B Φ B A Φ
A B C A B D C A B B C F A D D A B DT =
P =
4
A B Φ A A B B Φ B A Φ
![Page 24: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/24.jpg)
24
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted.
The number of mismatches
A B Φ A A B B Φ B A Φ
A B C A B D C A B B C F A D D A B DT =
P =
4 7
A B Φ A A B B Φ B A Φ
![Page 25: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/25.jpg)
25
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted.
The number of mismatches
A B Φ A A B B Φ B A Φ
A B C A B D C A B B C F A D D A B DT =
P =
4 7 7
A B Φ A A B B Φ B A Φ
![Page 26: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/26.jpg)
26
Mismatch problem with “don’t care”
Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols. and the rest are Φ(“don’t care”).
Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted.
The number of mismatches
A B Φ A A B B Φ B A Φ
A B C A B D C A B B C F A D D A B DT =
P =
24 7 7
A B Φ A A B B Φ B A Φ
![Page 27: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/27.jpg)
27
Mismatch problem with “don’t care” can be solved in
(Amir et, 1997), where n is the length of text T, m is the length of pattern P and g are the characters in the pattern which are not “don’t care” symbols.
)log( mgnO
![Page 28: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/28.jpg)
28
All locations with at most k mismatches of frequent symbols are our candidate locations where matches with k maximal number of mismatches start.
The number of mismatches
k
A B C A B D C A B B C F A D D A B DT =
P =
6 8 7 624 7 7
A B Φ A A B B Φ B A Φ
k = 4
Example
![Page 29: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/29.jpg)
29
Lemma 2 for Case 3-1
Let {a1,….,a }be frequent symbols. Then there exist in the text at most locations where there is a pattern occurrence with no more than k errors
kn2
Proof:
k
The total number of mark is at most n because the algorithm tests T(i) , i=1,2….n .
Let the number of locations which have marks larger than k be a Then
kk
nk 22a
kkk
nk 22a
k
nk
22a
![Page 30: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/30.jpg)
30
We convert marking stage to mismatch problem with “don’t care” and take to solve mismatch problem with “don’t care” problem.
According to lemma 2 for Case3-1, there are candidate locations and we take O(k) time to verify one candidate location.
Verification stage for Case3-1 takes time.
)log( mknO
kn2
)( knO
![Page 31: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/31.jpg)
31
Case 3-2:less than frequent symbolsk
First, we can check the number of mismatches by using convert all frequent symbols to Φ (“don’t care” symbol).
Let
P = ABCAABGDBAA
and k = 5
There are 5 ( 5 is between and 2k) distinct symbols in P and ‘A’ are frequent symbols. There are 1 (< )frequent symbols.
k2
k
T = ABCABDCABBCFADDABCT = ABCABDCABBCFADDABCT = ABCABDCABBCFADDABC
Example
![Page 32: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/32.jpg)
32
Two cases are discussed after we convert all frequent symbols to Φ.
3-2-1: There are less than 2k remaining symbols.
3-2-2: There are at least 2k remaining symbols.
![Page 33: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/33.jpg)
33
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time.
T = A B C A B D C A B B C F A D D A B C
Φ B C Φ Φ B G D B Φ ΦP’ =
mismatches of remaining =symbols
3
Φ B C Φ Φ B G D B Φ Φ
)log( mknO
![Page 34: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/34.jpg)
34
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time.
T = A B C A B D C A B B C F A D D A B C
Φ B C Φ Φ B G D B Φ ΦP’ =
mismatches of remaining =symbols
3
Φ B C Φ Φ B G D B Φ Φ
)log( mknO
5
![Page 35: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/35.jpg)
35
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time.
T = A B C A B D C A B B C F A D D A B C
Φ B C Φ Φ B G D B Φ ΦP’ =
mismatches of remaining =symbols
3
Φ B C Φ Φ B G D B Φ Φ
)log( mknO
5 6
![Page 36: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/36.jpg)
36
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time.
T = A B C A B D C A B B C F A D D A B C
Φ B C Φ Φ B G D B Φ ΦP’ =
mismatches of remaining =symbols
3
Φ B C Φ Φ B G D B Φ Φ
)log( mknO
5 6 4
![Page 37: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/37.jpg)
37
Case3-2-1 There are less than 2k remaining symbols
There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time.
T = A B C A B D C A B B C F A D D A B C
Φ B C Φ Φ B G D B Φ ΦP’ =
mismatches of remaining =symbols
3
Φ B C Φ Φ B G D B Φ Φ
)log( mknO
5 6 4 4
![Page 38: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/38.jpg)
38
All locations which have less than k mismatches of all frequent symbols and remaining symbols are matches which we want.
![Page 39: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/39.jpg)
39
• Conclusion:
The problem for Case 3-2-1 can be solved in time
)log( mknO
![Page 40: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/40.jpg)
40
Case3-2-2 There are at least 2k remaining symbols
There are two stages in algorithm for this case.
(1)Marking stage
Identify potential starts of the pattern and do a crude pruning of the potential candidates.
(2)Verification stage
Verify which of the potential candidates is indeed a pattern which occurred.
Verification stage will be done by Kangaroo Method.
![Page 41: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/41.jpg)
41
Marking stage of Case 3-2-2
We pick arbitrarily 2k remaining symbols and convert all symbols to Φ(“don’t care” symbols) except 2k remaining symbols which we picked.
Marking stage of Case3-2-2 can be solved as mismatch problem with “don’t care” in time. )log( mknO
![Page 42: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/42.jpg)
42
• Conclusion:
The problem for Case 3-2-2 can be solved in time)log( mknO
![Page 43: Faster algorithms for string matching with k mismatches](https://reader036.vdocuments.mx/reader036/viewer/2022062500/56814f3c550346895dbce2be/html5/thumbnails/43.jpg)
43
Thank you