1 a simple fast hybrid pattern- matching algorithm department of computer science and information...
Post on 18-Dec-2015
218 views
TRANSCRIPT
![Page 1: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/1.jpg)
1
A simple fast hybrid pattern-
matching algorithm
Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
Authors: Frantisek Franek, Christopher G. Jennings , W. F. Smyth Publisher: Journal of Discrete Algorithms 2007
Present: Chung-Chan Wu
Date: December 11, 2007
![Page 2: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/2.jpg)
2
Outline
Introduction Algorithm Description
• KMP (Knuth-Morris-Pratt)
• Boyer-Moore
• Sunday shift
The Hybrid Algorithm (FJS) Extension Experimental Results Conclusions
![Page 3: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/3.jpg)
3
Introduction
This contribution resides in these categories: • In an effort to reduce processing time, we propose a mixture of
Sunday’s variant of BM with KMP.
• Our goal is to combine the best/average case advantages of Sunday’s algorithm (BMS) with the worst case guarantees of KMP
According to the experiments we have conducted, our new algorithm (FJS) is among the fastest in practice for the computation of all occurrences of a pattern p = p[1..m] in a text string x = x[1..n] on an alphabet Σ of size k.
![Page 4: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/4.jpg)
4
KMP (Knuth-Morris-Pratt)
Main Feature• Perform the comparisons from left to right
• Space and time complexity : O(m)
• Searching phase : O(m+n)
• A pre-compute table called pi-table to compare backward.
• The π value will avoid another immediate mismatch
• the character of the prefix in the pattern must be different from the character comparing presently.
• The best worst case running time in software algorithm.
![Page 5: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/5.jpg)
5
KMP (Knuth-Morris-Pratt)
index 0 1 2 3 4 5 6 7
pattern[ i ] G C A G A G A G
π-value -1 0 0 -1 1 -1 1 -1
Input string:pattern:
GCA T CGC AGAGAG T A T A C AG T A CG
GC AGAGAG
![Page 6: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/6.jpg)
6
Boyer-Moore
Main Feature• Performs the comparisons from right to left
• Preprocessing phase : O(m+δ) in Space and time complexity
• Searching phase : O(mn)
• A pre-compute table called delta_1 and delta_2.
• Perform well in best / average case.
![Page 7: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/7.jpg)
7
BM - Observation 1
If char is known not occur in pattern, then we know we need not consider the possibility of an occurrence of the pattern.
Input string:pattern:
ub
ua
contains no bbad-character shift
b does not occur in the pattern, use δ1
m k
![Page 8: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/8.jpg)
8
BM - Observation 2
If the rightmost occurrence of char in pattern is δ1 characters from the right end of pattern, then we know we can slide pattern down δ1 positions without checking for matches.
Input string:pattern:
ub
ua
bad-character shift
b occurs in the pattern, use δ1
contains no bb
m k
![Page 9: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/9.jpg)
9
BM - Observation 3(a)
The good-suffix shift consists in aligning the segment
y[i+j+1 … j+m-1] = x[i+1 … m-1] with its rightmost occurrence in x that is preceded by a character different from x[i].
Input string:pattern:
ub
ua
good-suffix shift
u reoccurs in pattern preceded by c ≠ a, use δ2
~a u
m k
![Page 10: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/10.jpg)
10
BM - Observation 3(b)
If there exists no such a segment, the shift consists in aligning the longest suffix v of y[i+j+1 … j+m-1] with a matching prefix of x.
Input string:pattern:
ub
ua
good-suffix shift
Only a suffix v of u reoccurs in pattern, use δ2
v
v
m k
![Page 11: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/11.jpg)
11
Boyer-Moore Example
δ1 A E L M P X rest
shift 4 6 1 3 2 5 7
δ2 E X A M P L E
shift 12 11 10 9 8 7 1
HERE I S A S I MPLE EXAMPLE
EXAMPLE
Input string:pattern:
![Page 12: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/12.jpg)
12
Sunday Shift
δ1 A E L M P X rest
shift 4 6 1 3 2 5 7
Input string:pattern:
prefix b
m p l eaxe
p
δ1 A E L M P X rest
shift 5 1 2 4 3 6 8
Input string:pattern:
prefix b
m p l eaxe
ps
m p l eaxe
Boyer Moore
Sunday Shift
![Page 13: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/13.jpg)
13
FJS Algorithm
Definitions• Search p = p[1..m] in x = x[1..n]
• by shifting p from left to right along x.
• position j = 1 of p is aligned with a position i 1∈ ..n − m + 1 in x
• partial match: if a mismatch occurs at j >1, we say that a partial match has been determined with p[1..j − 1].
• i’ = i + m - j
Input string:pattern:
pat
pat
i i’
jm
![Page 14: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/14.jpg)
14
FJS Algorithm
Strategy• Whenever no partial match of p with x[i..i + m − 1] has been
found,
• Sunday shifts are performed to determine the next position i’ at which x[ i’ ] = p[m]. When such an i has been found, KMP matching is then performed on p[1..m− 1] and x[i −m+ 1..i − 1].
• If a partial match of p with x has been found, KMP matching is continued on p[1..m].
• once a suitable i’ has been found, the first half of FJS just performs KMP matching in a different order:
• position m of p is compared first, followed by 1, 2, . . . , m − 1
![Page 15: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/15.jpg)
15
FJS Algorithm
Pre-processing• Sunday’s array Δ = Δ[1..k], computable in O(m + k) time.
• KMP array β’ = β’[1..m+1], computable in O(m) time.
![Page 16: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/16.jpg)
16
FJS Algorithm
index 0 1 2 3 4 5 6 7
pattern[ i ]
G C A G A G A G
π-value -1 0 0 -1 1 -1 1 -1δ1 A C G rest
shift 2 7 1 9
GCA T CGCAGAGAGT A T ACAGT ACG
GCAGAGAG
Input string:pattern:
![Page 17: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/17.jpg)
17
Extension
The alphabet-based preprocessing arrays of BM-type algorithms are their most useful feature, but they can be a source of trouble as well.
The ASCII alphabet:
• Text were usually of 8 bits or less.
• The processing time can be regardless. The natural language text:
• Wide characters
• DNA data: {A, T, C, G} is mapping into {00, 01, 10, 11}, the alphabets of size varying by powers of 2 from 2 to 64.
• An example DNA : ACTG
• The preprocessing time is a bottleneck.
42 )2(
![Page 18: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/18.jpg)
18
Environment
![Page 19: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/19.jpg)
19
Experiment – Frequency
These patterns occur 3,366,899 times
![Page 20: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/20.jpg)
20
Experiment – Pattern Length
![Page 21: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/21.jpg)
21
Experiment – Alphabet Size
![Page 22: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/22.jpg)
22
Experiment – Pathological Cases
![Page 23: 1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d225503460f949f896c/html5/thumbnails/23.jpg)
23
Conclusion
We have tested FJS against four high-profile competitors (BMH, BMS, RC, TBM) over a range of contexts:
• pattern frequency (C1), pattern length (C2), alphabet size (C3), and pathological cases (C4).
FJS was uniformly superior to its competitors, with an up to 10% advantage over BMS and RC
For FJS the pathological cases (C4) are those in which the KMP part is forced to execute on prefixes of patterns where KMP provides no advantage
we presented a hybrid exact pattern-matching algorithm, FJS, which combines the benefits of KMP and BMS. It requires
O(m + k) time and space for preprocessing