recuperació de la informació
DESCRIPTION
Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://static.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/1.jpg)
Recuperació de la informació
•Modern Information Retrieval (1999)Ricardo-Baeza Yates and Berthier Ribeiro-Neto
•Flexible Pattern Matching in Strings (2002)Gonzalo Navarro and Mathieu Raffinot
http://static.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html
•Algorithms on strings (2001)M. Crochemore, C. Hancart and T. Lecroq
•http://www-igm.univ-mlv.fr/~lecroq/string/index.html
![Page 2: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/2.jpg)
String Matching
String matching: definition of the problem (text,pattern) depends on what we have: text or patterns• Exact matching:
• Approximate matching:
• 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and ||
• The text ----> Data structure for the text (suffix tree, ...)
• The patterns ---> Data structures for the patterns
• Dynamic programming • Sequence alignment (pairwise and multiple)
• Extensions • Regular Expressions
• Probabilistic search:
• Sequence assembly: hash algorithm
Hidden Markov Models
![Page 3: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/3.jpg)
Extended string matching
• Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}.
• Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters.
• Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text.
• Wild cards: we find pattern as AT*TA where * means an arbitrary long string.
• Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..
![Page 4: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/4.jpg)
Classes of characters
There are characters in the text that represent sets of simbols
1. Classes of characters in the tetx.
There are characters in the pattern that represent sets of simbols
2. Classes of characters in the pattern.
There are classes of characters represented by onesymbol. For instace the IUPAC code for the
DNA alphabet is:R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T}
B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)
![Page 5: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/5.jpg)
Extended alphabets
First part
Classes in the text
![Page 6: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/6.jpg)
Classes in the text: Brute force algorithm
Text : over 2|∑|
Pattern over From left to right: prefix
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text :
The window is shifted only one cell
We need the operation: belongs to a set ?
?
![Page 7: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/7.jpg)
Classes in the text: Brute force algorithm
Every subset of is represented by a string of bits of length | |.
When || < computer word
For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=( , , , )
![Page 8: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/8.jpg)
Classes in the text: Brute force algorithm
Every subset of is represented by a string of bits of length | |.
When || < computer word
For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=( , , , )
![Page 9: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/9.jpg)
Classes in the text: Brute force algorithm
Every subset of is represented by a string of bits of length | |.
When || < computer word
For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)
Then the operation “A belongs to set X” is made with ...
![Page 10: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/10.jpg)
Classes in the text: Brute force algorithm
Every subset of is represented by a string of bits of length | |.
G T A R T R N A G G A ...A T G T A
A T G T A
When || < computer word
For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)
Then the operation “A belongs to set X” is made with I(A) and I(X) >0
I(A) & I(T)>0
![Page 11: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/11.jpg)
Classes in the text: Brute force algorithm
Every subset of is represented by a string of bits of length | |.
G T A R T R N A G G A ...A T G T A
A T G T A
A T G T A
When || < computer word
For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)
Then the operation “A belongs to set X” is made with I(A) and I(X) >0
I(A) & I(T)>0
I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0
![Page 12: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/12.jpg)
Classes in the text: Brute force algorithm
Every subset of is represented by a string of bits of length | |.
G T A R T R N A G G A ...A T G T A
A T G T A
A T G T A
When || < computer word
For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)
Then the operation “A belongs to set X” is made with I(A) and I(X) >0
I(A) & I(T)>0
I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0
I(A) & I(N)>0 I(T) & I(R)>0
...Which is the cost?
![Page 13: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/13.jpg)
Classes in the text
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
![Page 14: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/14.jpg)
Classes in the text: Horspool algorithm
Text :
Pattern :Sufix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
Shift until the next ocurrence of “a” (or “t”,”r”,…) in the pattern:
aa a
a a a
We need a shift table with the extended alphabet.
![Page 15: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/15.jpg)
Classes in the text :Horspool example
Given the pattern ATGTA
• The shift table is:
A 4C 5G 2T 1R ?…N ?
![Page 16: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/16.jpg)
Classes in the text :Horspool example
Given the pattern ATGTA
• The shift table is:
A 4C 5G 2T 1R 2…N ?
![Page 17: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/17.jpg)
Classes in the text :Horspool example
Given the pattern ATGTA
• The shift table is:
A 4C 5G 2T 1R 2…N 1
text : G T A R T R N A A G G A …A T G T A
A T G T A
A T G T A
![Page 18: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/18.jpg)
Classes in the text :Horspool example
Given the pattern ATGTA
• The shift table is:
A 4C 5G 2T 1R 2…N 1
text : G T A R T R N A A G G A ...A T G T A
A T G T A
A T G T A A T G T A
…
![Page 19: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/19.jpg)
Classes in the text
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
![Page 20: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/20.jpg)
Text :
Pattern :
Search for suffixes of T that are factors of the pattern
Classes in the text: BNDM algorithm
• Which is the next position of the window ?
• How the comparison is made?
…that is denoted as
D2 = 1 0 0 0 1 0 0
Depends on the value of the leftmost bit of D
Once the next character x is read D3 = D2<<1 & B(x)
B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0)
D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )
x
![Page 21: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/21.jpg)
Classes in the text : BNDM example
Given the pattern ATGTA
• The masks of bits are
B(A) = ( 1 0 0 0 1 ) B(R)=( )B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=( )B(T) = ( 0 1 0 1 0 )
![Page 22: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/22.jpg)
Classes in the text : BNDM example
Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=( )B(T) = ( 0 1 0 1 0 )
• The masks of bits are
![Page 23: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/23.jpg)
Classes in the text : BNDM example
Given the pattern ATGTA
• text : G T A R T R N A G G A C G ...A T G T A
A T G T A
B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1)B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 )D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
• The masks of bits are
![Page 24: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/24.jpg)
Classes in the text : BNDM example
Given the pattern ATGTA
• text : G T A R T R N A G G A C G ...A T G T A
A T G T A
B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1)B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 )
D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 )
D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0)
• The masks of bits are
![Page 25: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/25.jpg)
Classes in the text : BNDM example
Given the pattern ATGTA
• text : G T A R T R N A G G A C G ...A T G T A
A T G T A
A T G T A
B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1)B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 )
D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 )
D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0)
…
• The masks of bits are
![Page 26: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/26.jpg)
Classes in the text
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
![Page 27: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/27.jpg)
BOM algorithm (Backward Oracle Matching)
• Which is the next position of the window?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle
Check if the suffix is a factor
The position determined by the last character of the text with a transition in the automata
![Page 28: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/28.jpg)
Classes in the text: BOM example
The we build the AFO of the inverse pattern of ATGTATG
… and we try to find… : G T A R T R N A A T G…A T G T A T G
GG AT T ATTA
G
It’s not possible any improvement!
![Page 29: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/29.jpg)
Multiple string matching
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMlmin
(5 strings)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-ManberSBOM
(100 strings)
Ad AC
![Page 30: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/30.jpg)
Classes in the text: Set Horspool algorithm
Text :
Patterns:
By suffixes
• Which is the next position of the window?
• How the comparison is made?
Trie of all inverse patterns
?
![Page 31: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/31.jpg)
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
4. Find the patterns
T A
G
GA
TTT
T
G
A
A
AA T
1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA
2. Determine lmin=4A 1C 4 (lmin)G 2T 1
3. Determine the shift table
![Page 32: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/32.jpg)
Classes in the text: Set Horspool
Search for the patterns ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
text: ARTGNCTATGTGACA…
It’s not possible any improvement!
![Page 33: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/33.jpg)
Multiple string matching
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMlmin
(5 strings)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-ManberSBOM
(100 strings)
Ad AC
![Page 34: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/34.jpg)
Classes in the text: SBOM algorithm
• Which is the next position of the window?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle (Inverse patterns of length lmin)
Check if the suffix is a factor of any pattern
The position determined by the last character of the text with a transition in the automata
![Page 35: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/35.jpg)
Classes in the text: SBOM example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
text: ACATN C TAGC TA TA ATAATGTATG
A
It’s not possible any improvement!
![Page 36: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/36.jpg)
Extended alphabets
Classes in the:text pattern
Horspool ✓BNDM ✓BOM ✗
Set-Horspool ✗SBOM ✗
![Page 37: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/37.jpg)
Extended search
Second part
Classes in the pattern
![Page 38: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/38.jpg)
Classes in the pattern: Brute force algorithm
Text : over From left to right: prefix
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text :
The window is shifted only one cell
We need the operation: belongs to a set ?
?
Pattern : over 2|∑|
![Page 39: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/39.jpg)
Classes in the pattern: Brute force algorithm
Every subset is represented by a string of bits of length | |.
G T A C T A G A G G A C G T A T G T A C T G ...A T N T R
A T N T R
When || < computer word
For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=(1,0,1,0,),..., I(N)=(1,1,1,1)
Then the operation “A belongs to set X” is made with I(A) and I(X) >0
I(T) and I(R) >0
I(A) and I(R) >0 I(T) and I(T) >0 I(C) and I(N) >0 I(A) and I(T) >0 …
![Page 40: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/40.jpg)
Classes in the text
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
![Page 41: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/41.jpg)
Classes in the pattern: Horspool algorithm
Text :
Pattern :Sufix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
Shift until the next ocurrence of “a” in the pattern:
aa a
a a a
We need a preprocessing phase to construct the shift table.
![Page 42: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/42.jpg)
Classes in the pattern: Horspool example
Given the pattern ATNTR
• The shift table is:
A C G T
![Page 43: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/43.jpg)
Classes in the pattern: Horspool example
Given the pattern ATNTR
• The shift table is:
A 2C G T
![Page 44: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/44.jpg)
Classes in the pattern: Horspool example
Given the pattern ATNTR
• The shift table is:
A 2C 2G T
![Page 45: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/45.jpg)
Classes in the pattern: Horspool example
Given the pattern ATNTR
• The shift table is:
A 2C 2G 2T
![Page 46: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/46.jpg)
Classes in the pattern: Horspool example
Given the pattern ATNTR
• The shift table is:
A 2C 2G 2T 1
text : G T A C T A G A T A T G A G ...A T N T R
A T N T R
A T N T R
A T N T R A T N T R
![Page 47: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/47.jpg)
Classes in the pattern: Horspool example
Given the pattern ATNTR
• The shift table is:
A 2C 2G 2T 1
text : G T A C T A G A T A T G A G ...A T N T R
A T N T R
A T N T R
A T N T R A T N T R
A T G T A
Shorter shifts!
![Page 48: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/48.jpg)
Classes in the text
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
![Page 49: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/49.jpg)
Text :
Pattern :
Search for suffixes of T that are factors of the pattern
Classes in the text: BNDM algorithm
• Which is the next position of the window ?
• How the comparison is made?
…that is denoted as
D2 = 1 0 0 0 1 0 0
Depends on the value of the leftmost bit of D
Once the next character x is read D3 = D2<<1 & B(x)
B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0)
D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )
x
![Page 50: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/50.jpg)
Classes in the pattern : BNDM example
Given the pattern ATNTR
• The masks of bits of symbols are
B(A) = ( )B(C) = ( )B(G) = ( )B(T) = ( )
![Page 51: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/51.jpg)
Classes in the pattern : BNDM example
Given the pattern ATNTR
• The masks of bits of symbols are
B(A) = ( 1 0 1 0 1 )B(C) = ( )B(G) = ( )B(T) = ( )
![Page 52: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/52.jpg)
Classes in the pattern : BNDM example
Given the pattern ATNTR
• The masks of bits of symbols are
B(A) = ( 1 0 1 0 1 )B(C) = ( 0 0 1 0 0 )B(G) = ( )B(T) = ( )
![Page 53: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/53.jpg)
Classes in the pattern : BNDM example
Given the pattern ATNTR
• The masks of bits of symbols are
B(A) = ( 1 0 1 0 1 )B(C) = ( 0 0 1 0 0 )B(G) = ( 0 0 1 0 1 )B(T) = ( )
![Page 54: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/54.jpg)
Classes in the pattern : BNDM example
Given the pattern ATNTR
• text : G T A C T A G A G G A C G T A T G T A C T G ...A T N T R
A T N T R
A T N T R
• The masks of bits of symbols are
B(A) = ( 1 0 1 0 1 )B(C) = ( 0 0 1 0 0 )B(G) = ( 0 0 1 0 1 )B(T) = ( 0 1 1 1 0 )
D1 = ( 0 1 1 1 0 )D2 = ( 1 1 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )
D1 = ( 0 0 1 0 1 )D2 = ( 0 1 0 1 0 ) & ( 0 0 1 0 1 ) = ( 0 0 0 0 0 )
D3 = ( 0 1 0 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 0 0 0 )
…
D1 = ( 1 0 1 0 1 )D2 = ( 0 1 0 1 0 ) & ( 0 1 1 1 0 ) = ( 0 1 0 1 0 )
D3 = ( 1 0 1 0 0 ) & ( 0 0 1 0 1 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) A T N T R
![Page 55: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/55.jpg)
Classes in the text
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
![Page 56: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/56.jpg)
BOM algorithm (Backward Oracle Matching)
• Which is the next position of the window?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle
Check if the suffix is a factor
The position determined by the last character of the text with a transition in the automata
![Page 57: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/57.jpg)
Classes in the pattern: BOM example
• Given the pattern ATGTATG, the AFO is
GG AT T ATTA
G
We should apply the SBOM algorithm!
but for the patter ATNTRTG?
![Page 58: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/58.jpg)
Multiple string matching
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMlmin
(5 strings)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-ManberSBOM
(100 strings)
Ad AC
![Page 59: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/59.jpg)
Set Horspool algorithm
Text :
Patterns:
By suffixes
• Which is the next position of the window?
• How the comparison is made?
a
Trie of all inverse patterns
We shift until a is aligned with the first a in the trie not longer than lmin, or lmin
![Page 60: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/60.jpg)
Set Horspool algorithm
Search for ATNTARG,RTGR,NTTNAR,ATRTG
4. Find the patterns
1. Construct the trie of the 46 possible inverse patterns
2. Determine lmin=4
A 1C 2G 1T 1
3. Determine the shift table
![Page 61: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/61.jpg)
Multiple string matching
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMlmin
(5 strings)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-ManberSBOM
(100 strings)
Ad AC
![Page 62: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/62.jpg)
SBOM algorithm
• Which is the next position of the window?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle (Inverse patterns of length lmin)
Check if the suffix is a factor of any pattern
The position determined by the last character of the text with a transition in the automata
![Page 63: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/63.jpg)
Classes in the patterns: SBOM example
the Automata Factor Oracle of all 21 possible patterns is built …
Given the patternsATGNARG, TRATR,TAATAAT i ANTNTGR
![Page 64: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/64.jpg)
Extended alphabets
Classes in the:text pattern
Horspool ✓ ✓BNDM ✓ ✓BOM ✗ ≈
Set-Horspool ✗ ≈ SBOM ✗ ≈
![Page 65: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/65.jpg)
Extended string matching
• Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}.
• Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters.
• Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text.
• Wild cards: we find pattern as AT*TA where * means an arbitrary long string.
• Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..
![Page 66: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/66.jpg)
Bounded length gaps : BNDM example
Given the pattern ATx(2,3)TA
• The masks of bits are
B(A) = ( 1 0 1 1 1 0 1 ) B(C) = ( 0 0 1 1 1 0 0 )B(G) = ( 0 0 1 1 1 0 0 )B(T) = ( 0 1 1 1 1 1 0 )
![Page 67: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/67.jpg)
Bounded length gaps : BNDM example
Given the pattern ATx(2,3)TA
• text : A T A G T A G A G T ...
D2 = ( 0 1 1 1 0 1 0 ) & ( 0 1 1 1 1 1 0 ) = ( 0 1 1 1 0 1 0 )D3 = ( 1 1 1 0 1 0 0 ) & ( 0 0 1 1 1 0 0 ) = ( 0 0 1 0 1 0 0 )
• The masks of bits are• The masks of bits are
B(A) = ( 1 0 1 1 1 0 1 ) B(C) = ( 0 0 1 1 1 0 0 )B(G) = ( 0 0 1 1 1 0 0 )B(T) = ( 0 1 1 1 1 1 0 )
D1 = ( 1 0 1 1 1 0 1 )
D4 = ( 0 1 0 1 0 0 0 ) & ( 1 0 1 1 1 0 1 ) = ( 0 0 0 1 0 0 0 )
D5 = ( 0 0 1 0 0 0 0 ) & ( 0 1 1 1 1 1 0) = ( 0 0 1 0 0 0 0 )
D6 = ( 0 1 0 0 0 0 0 ) & ( 1 0 1 1 1 0 1) = ( 0 0 0 0 0 0 0 ) ?
![Page 68: Recuperació de la informació](https://reader030.vdocuments.mx/reader030/viewer/2022032805/56813196550346895d98087e/html5/thumbnails/68.jpg)
Bounded length gaps : BNDM example
Given the pattern ATx(2,3)TA • text : A T A G T A G A G T ...
D2 = ( 0 1 1 1 0 1 0 ) & ( 0 1 1 1 1 1 0 ) = ( 0 1 1 1 0 1 0 )D3 = ( 1 1 1 0 1 0 0 ) & ( 0 0 1 1 1 0 0 ) = ( 0 0 1 0 1 0 0 )
D1 = ( 1 0 1 1 1 0 1 )
D4 = ( 0 1 0 1 0 0 0 ) & ( 1 0 1 1 1 0 1 ) = ( 0 0 0 1 0 0 0 )D5 = ( 0 0 1 0 0 0 0 ) & ( 0 1 1 1 1 1 0) = ( 0 0 1 0 0 0 0 )
D6 = ( 0 1 0 0 0 0 0 ) & ( 1 0 1 1 1 0 1) = ( 0 0 0 0 0 0 0 ) ?AT***TA
Let’s see the automaton:
ε0 1 0 0 0 0 0 -0 0 0 1 0 0 0
0 0 1 1 0 0 0
= F= I
D [ F - (I & D) ] & ¬ F
( 0 0 1 1 0 0 0 ) 1 1 1 1