hannu peltola jorma tarhio aalto university finland variations of forward-sbndm
TRANSCRIPT
![Page 1: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/1.jpg)
Hannu Peltola Jorma Tarhio
Aalto University
Finland
Variations of Forward-SBNDM
![Page 2: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/2.jpg)
Aug. 29, 2011
Aims
Tuning algorithms for exact string matching.
Studying the effect of simultaneous 2-byte read.
![Page 3: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/3.jpg)
Aug. 29, 2011
SBNDMSimple Backward Nondeterministic DAWG Matching
SBNDM [18] is a simplification of BNDM [17]. Both are bit-parallel algorithms.
Text T = t1...tn, pattern P = p1...pm.
At each alignment window of P in T, scan T from right to left until the suffix of the window is not a factor of P or an occurrence of P is found.
![Page 4: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/4.jpg)
Aug. 29, 2011
Shift of SBNDM
No factor: m
P found: 1
Else: next alignment starts at the last factor
![Page 5: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/5.jpg)
Aug. 29, 2011
SBNDM, example
P = banana, T = antanabadbanana...
alignment: antanabadbanana a na ana
![Page 6: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/6.jpg)
Aug. 29, 2011
SBNDM, example
P = banana, T = antanabadbanana...
alignment: antanabadbanana a na ana
not a factor: tananext alignment: antanabadbanana
![Page 7: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/7.jpg)
Aug. 29, 2011
SBNDM, example
P = banana, T = antanabadbanana...
alignment: antanabadbanana a na ana
not a factor: tananext alignment: antanabadbanana not a factor: dnext alignment: antanabadbanana
![Page 8: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/8.jpg)
Aug. 29, 2011
SBNDMq
SBNDMq [6] is a tuned version of SBNDM.
Processing of an alignment starts with checking a q-gram.
Let q = 4. Consider an alignment at antana. Instead of testing four suffixes a, na, ana, tana,only tana is tested.
Testing is done in a fast loop.
![Page 9: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/9.jpg)
Aug. 29, 2011
Forward-SBNDM
Forward-SBNDM (FSB for short) by Faro & Lecroq [7] is a lookahead version of SBNDM2.
Both FSB and SBNDM2 read a 2-gram x1x2 before a factor test.
x1x2 is matched with the end of P in SBNDM2.
Only x1 is matched with the end of P in FSB, and x2 is a lookahead character following the current alignment.
FSB is faster than SBNDM2 for large alphabets.
![Page 10: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/10.jpg)
Aug. 29, 2011
Generalization of FSB: FSB(q,f)
FSB(q,f) (= Forward-SBNDM(q,f)) is SBNDMq with f lookahead characters, f = 0, 1, ..., q-1.
FSB(2,1) = FSB and FSB(q,0) = SBNDMq.
Motivation: SBNDMq works well on modern processors also for q>2.
![Page 11: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/11.jpg)
Aug. 29, 2011
FSB(q,f)
Let UV be a q-gram, where |V| = f.
After reading UV there are 3 alternatives:i. If U is a suffix of P, reading continues leftwards.
ii. Else if UV is a factor of P, reading continues leftwards.
iii. Else the state vector is zero and P is shifted m-q+f+1 positions
(f positions more than in SBNDMq).
![Page 12: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/12.jpg)
Aug. 29, 2011
Occurrence vectors in FSB(q,2)
Example: P = banana
bananaSBNDMq: B[n] = 00001010
FSB(q,2): B[n] = 00101011 B[a] = 01010111 B[x] = 00000011
extra bits
![Page 13: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/13.jpg)
Aug. 29, 2011
State vectors in FSB(q,2) for q=4
4-gram nanx: x 00000011 n 00101011 a 01010111 n 00101011
00001000
4-gram State vector Conclusionnanx 00001000 na is a suffix of Pxana 00000000 not a factoranan 01000000 factor of P
nanx is not a factor
![Page 14: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/14.jpg)
Aug. 29, 2011
Benefits / drawbacks of lookahead characters and extra bits
Benefits
• Longer shifts more speed
• Combined suffix/factor test
Drawback
• More q-grams accepted less speed
![Page 15: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/15.jpg)
Aug. 29, 2011
Greedy skip loop for SBNDM2 (GSB2 = Greedy-SBNDM2)
Factor tests of two 2-grams are done in one round.
Let B2[x,y] denote the combined occurrence vector of characters x and y. B2[x,y] = B[x] & (B[y]<<1)
next:D B2[ti,ti+1]if D = 0 then if B2[ti+m-1,ti+m] = 0 then i i+2*m-2
goto next
![Page 16: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/16.jpg)
Aug. 29, 2011
2-byte read
Read two characters (= 2 bytes = 16 bits) in one instruction (in a skip loop).
Suits well q-gram algorithms with even q.
For experiments we made two versions of the algorithms:• Standard (1-byte read)
• b-version using 2-byte read
![Page 17: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/17.jpg)
Aug. 29, 2011
2-byte read (cont.)
Advantage: a part of computation can moved to preprocessing phase
• Example: B2[x,y] = B[x] & (B[y]<<1)
Speed-up factor even more than 2
Drawback: extra 0.1 ms for preprocessing.
![Page 18: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/18.jpg)
Aug. 29, 2011
4-byte read?
Many border crosses happen => slow down
232 tables too big for practice
![Page 19: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/19.jpg)
Aug. 29, 2011
Experimental results/KJV Bible
In the recent comparison S. Faro, T. Lecroq: The Exact String Matching Problem: a Comprehensive Experimental Evaluation
(2010), the algorithms EBOM and Hash3 were the fastest
in the bible text for m = 4,...,20.4 8 16
Hash3 14.6 5.42 2.79
EBOM 6.53 3.87 2.91
![Page 20: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/20.jpg)
Aug. 29, 2011
KJV: EBOM & Hash3 (on ThinkPad X61s)
0
0,5
1
1,5
2
2,5
3
3,5
4
4 8 12 16 20
m
GB
/s
EBOM
Hash3
![Page 21: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/21.jpg)
Aug. 29, 2011
KJV: EBOMb & Hash3b (with 2-byte read) added
0
0,5
1
1,5
2
2,5
3
3,5
4
4 8 12 16 20
m
GB
/s
EBOM
EBOMb
Hash3
Hash3b
![Page 22: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/22.jpg)
Aug. 29, 2011
KJV: SBNDM2b = FSB(2,0)b added
0
0,5
1
1,5
2
2,5
3
3,5
4
4 8 12 16 20
m
GB
/s
EBOM
EBOMb
Hash3
Hash3b
FSB(2,0)b
![Page 23: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/23.jpg)
Aug. 29, 2011
KJV: GSB2b added
0
0,5
1
1,5
2
2,5
3
3,5
4
4 8 12 16 20
m
GB
/s
EBOM
EBOMb
Hash3
Hash3b
FSB(2,0)b
GSB2b
![Page 24: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/24.jpg)
Aug. 29, 2011
KJV: FSB(4,i)b added, i = 0,1,2
0
0,5
1
1,5
2
2,5
3
3,5
4
4 8 12 16 20
m
GB
/s
EBOM
EBOMb
Hash3
Hash3b
FSB(2,0)b
GSB2b
FSB(4,0)b
FSB(4,1)b
FSB(4,2)b
![Page 25: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/25.jpg)
Aug. 29, 2011
KJV: Speed-up factors of 2-byte read
GSB2 1.32FSB(2,0) 1.34FSB(2,1) 1.24FSB(4,0) 1.72FSB(4,1) 2.15FSB(4,2) 2.03Hash3 1.05EBOM 1.17
![Page 26: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/26.jpg)
Aug. 29, 2011
Other experiments
DNA and binary data was also tested.• Gain of lookahead characters or the greedy loop was smaller
than with the bible data.
Gain of 2-byte read was smaller with 64-bit code than with 32-bit code.
![Page 27: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/27.jpg)
Aug. 29, 2011
Conclusions
Two new algorithms were presented: • FSB(q,f)
• GSB2
The new algorithms are faster than earlier algorithms on English data:• GSB2 for m = 4, …, 8
• FSB(q,f) for m = 8, …, 20
2-byte read makes most string algorithms faster.
![Page 28: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649f4f5503460f94c70912/html5/thumbnails/28.jpg)
Aug. 29, 2011
Web site for practical speed comparison
cse.aalto.fi/stringmatching