a unifying framework for compressed pattern matching takuya kida, masayuki takeda, ayumi shinohara,...
TRANSCRIPT
A Unifying Framework forCompressed Pattern Matching
Takuya Kida, Masayuki Takeda, Ayumi Shinohara,
Yusuke Shibata, Setsuo Arikawa
Department of Informatics,Kyushu University, Japan
2
Contents
Pattern matching and compressed pattern matching
Previous results Collage system Proposed algorithm Conclusion
3
Pattern Matching Problem
We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach.
We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach.
text:=
pattern:= compresscompress
4
Compressed Pattern Matching
CompressedText
OriginalOriginalTextText
CompressedText
Pattern MatchingPattern Matching MachineMachine
New Machine !New Machine !
decompress
Previous Results(1)
1988 Eliam-Tsoreff and Vishkin run-length
1992 Amir, Landau, and Vishkin two-dimensional run-length
1995 Farach and Thorup LZ77
1996 Amir, Benson and Farach LZW
1997 Karpinski, Rytter, and Shinohara straight-line programs
1996 Gasieniec, et al. LZ77
1997 Miyazaki, Shinohara, and Takeda straight-line programs
1992 Amir and Benson two-dimensional run-length
Amir, Benson, and Farach1994 two-dimensional run-length
1997 Takeda finite state encoding
1998 Shibata byte pair encoding
1994 Manber original compression scheme
1998 Fukamachi, Shinohara, and Takeda Huffman encoding
1998 Kida, et al. LZW
year researcher compression
year researcher compression
1999 Shibata, Takeda, Shinohara, andArikawa
Antidictionaries
1999 Kida, Takeda, Shinohara, andArikawa
LZW
1999 Shibata, et al. Byte pair encoding
Kida, et al.1999 Dictionary based methods(Collage system)
1999 Navarro and Raffinot LZ family
Today’stalk
Today’stalk
Previous Results(2)
1998 de Moura, Navarro, Ziviani, andBaeza-Yates
Word based encoding
faster thanAgrep!
faster thanAgrep!
7
Motivation
Previous:Compression A PM Algorithm A
Compression B PM Algorithm B
Compression C PM Algorithm C
Ours: General Pattern matching algorithm onthe unifying framework
Compression A
Compression B
Compression C
Collage system
Collage SystemCollage System
Definition and Several Examples
9
Originaltext
Originaltext
Dictionary Based Compression
compressedtext
compressedtext
Dictionarystructure
Dictionarystructure
encoding
factorize into a series of phrases
How to choose the phrases.How to design the data structure of the dictionary.How to encode phrases.
10
Definition of Collage System
Collage system is a pair 〈 D, S 〉
S : A sequence of variables defined in D (Compressed text)
S := Xi1 , Xi2 , ・・・ , Xil ( Xi ∈D )
D : A sequence of assignments (Dictionary structure)
X1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;
||D|| = n : number of assignments in D
|S| = l : number of variables in S
11
Definition of Collage System
where exprk areX1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;
a a ∈Σ {ε∪ }, (primitive assignment)
Xi ・ X j (concatenation)for i, j < k,
( Xi ) j for i < k and integer j ( j times repetition)
D : A sequence of assignments (Dictionary structure)
[ j ]Xi(prefix truncation)for i < k and integer j
Xi [ j ] (suffix truncation)for i < k and integer j
Example of Collage System
X1 = a ;X2 = b ;
D :
S : X3 , X6 , X4 , X7
abbabbababba
X7 = X6 ・ X4 ;
X6 = [ 3 ]X5 ;
X5 = ( X3 )3 ;
X4 = X2 ・ X1 ;
X3 = X1 ・ X2 ;
babbabababababbaab
X7
X6 X4
X5
X3
X1 X2
X2 X1
a b )3 )[ 3 ] (( b a
prefixtruncation
3 timesrepetition
T(X7)
height(X7) = 4
height(D) = 4
13
Example of Collage SystemByte Pair Encoding (BPE)
D: X1 = a;
X2 = b;X4 = X1 ・ X2;
X5 = X4 ・ X3;
Original Text:a b a b c b a b c c a b c a c b
D D c b D c c D c a c bD E b E c E a c b
abDDcE
X3 = c;
S : X4 , X5 , X2 , X5 , X3 , X5 , X1 , X3 , X2
abDDcE
14
Example of Collage System (LZSS[gzip])
Xq+1 , Xq+2 , ・・・ , Xq+n
Xq+1 = (( [i1]Xl(1) ・ Xl(1)+1 ・・・ Xr(1))m1)[ j1] b1;
・・・
Xq+2 = (( [i2]Xl(2) ・ Xl(2)+1 ・・・ Xr(2))m2)[ j2] b2;
Xq+n = (( [in]Xl(n) ・ Xl(n)+1 ・・・ Xr(n))mn)[ jn] bn;
D: X1 = a1 ; X2 = a2 ; Xq = aq ;・・・
S :
15
What is ‘Collage’?
This is college!
16
Collage is ...
an artistic composition technique.
1. Cut or tear up materials.
2. Paste the pieces over a surface.
Our AlgorithmOur Algorithm
Pattern Matching Algorithmon a Collage System
Compressed pattern matching on a collage system
The problem of compressed pattern matchingcan be solved in
O( (||D||+|S|) ・ height(D) + m2 + r ) timeusing O( ||D|| + m2 ) space.
If D contains no truncation, it can be solved inO( ||D|| + |S| + m2 + r ) time.
m : pattern lengthr : number of pattern occurrences
||D|| : number of assignments in D|S| : number of variables in SO(compressed text
length+m2+r)
O(compressed text length+m2+r)
19
state: 0 1 2 3 4 3 4 5 11 2 4 1
S : Xi1 Xi2 Xi3 Xi4
7 : goto function: failure function
a0 1 2 4 5b ba b3
Pattern π= a b a b b
Basic Idea
original text: abababba
20
The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j] ・ u[1: i]}
The function Jump( j, u) =δKMP( j, u)
•This set contains the pattern occurrences.
•The domain is Q×D• It simulates the sequence of state transitions for u.
Jump and Output
Reply inO(1) timeReply inO(1) time
Reply inO( l ) timeReply in
O( l ) time
21
Realization of Jump
for Jump( q, Xk) , if Xk is ...
a
Xi ・ X j
O(1) time
If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time.
[ j ]Xi
Xi [ j ] O( height(Xi) ) time
( Xi ) j O(1) time
22
Factor Concatenation Problem
example: P =COPACABANA
OPA , CABAN OPACABAN‘Yes’! P[2:9]concatenate
Instance: Two factors x and y of a string Peach represented as a node of suffix trie of P.Question: Is the string xy a factor of P ?If ‘yes’ then return its node number.
23
Solution to the problem
Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space.
Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing.
It can be solved in O(1) time after O(m2) space and time preprocessing.
24
Realization of Output
a
Xi ・ X j
O(1) time
[ j ]Xi
Xi [ j ] O( l ・ height(Xi) ) time
( Xi ) j O( l ) time
for Output( q, Xk), if Xk is ...
It can be enumerate in O( l ) time
from Output of Xi and X j .
Size of the set Output
Size of the set Output
Outline of Our Algorithm
Input. pattern P and Collage system: 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.
Input. pattern P and Collage system: 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.
/* preprocess for D and P */ preprocess(D); preprocess(P);
l:=0; q:=0;for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’;
q:= Jump(q, Xij); /* state transition */
l:= l + |Xij |; /* calculate the offset */end
Concluding RemarksConcluding Remarks
Conclusion and Future Works
27
Our Results
If D contains no truncation : O( ||D|| + |S| + m2 + r ) time
1998 Kida, et al. ( LZW ) : O( n + m2 ) spaceO( n + m2 + r ) time
LZ78, LZW, BPE, Run-length, etc...
LZ78, LZW, BPE, Run-length, etc...
no truncation
LZ77, LZSS, etc...LZ77, LZSS, etc...
truncation
Complexity of our algorithm: O( ||D|| + m2 ) space
O( (||D|| + |S| ) ・ height(D) + m2 + r ) time
28
Conclusion
We introduced a general framework for compressed pattern matching (Collage system)
We proposed a compressed pattern matching algorithm on collage system and showed its complexity. O( (||D||+|S|) ・ height(D) + m2 + r ) time O( ||D|| + m2 ) space ( If no truncation ) O( ||D|| + |S| + m2 + r )
time
29
Future Works
Can we reduce the complexity of the preprocessing? O(m2) O(m)
To improve our algorithm for dealing with multiple patterns.
To develop an approximate pattern matching algorithm on a collage system.
To develop a new compression which is suitable for compressed pattern matching.