text indexing and dictionary matching with one error
DESCRIPTION
Text Indexing and Dictionary Matching with One Error. Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325. Adviser: R. C. T. Lee Speaker: C. W. Cheng. Problem Definition. The Indexing Problem : - PowerPoint PPT PresentationTRANSCRIPT
1
Text Indexing and Dictionary Matching with One ErrorAmir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000,
pp. 309-325
Adviser: R. C. T. LeeSpeaker: C. W. Cheng
2
Problem Definition
• The Indexing Problem :– Input : A Text T of length n over alphabet Σ,
a pattern P of length m over alphabet Σ and an integer k.
– Output : All occurrences of P in T with at most k mismatches.
3
Main idea
• In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.
4
Processing
• 1.Construct a suffix tree ST of the text string T and suffix tree ST
R of the string TR is the reversed text TR = tn … t1.
5
Ex : T=AGCAGAT
TR=TAGACGA
6
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
7
Processing
• 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.
8
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
9
Processing
• 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.
10
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
11
Processing
• 4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in ST
R by n – i + 3, where i is the starting position of the leaf’s suffix in TR.
12
4 1 6 3 5 2 7 3 6 8 5 4 7 9
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
13
Query Processing
• For j = 1, …., m do– 1. Find node v, the location of Pj+1 … Pm in ST, if such a
node exists.– 2. Find node w, the location of Pj-1 .. P1 in ST
R, if such a node exist.
– 3. If v and w exist, the values of leaves under v and w are V[vl….vr] and W[wl…wr], to find the intersections I of V[vl….vr] and W[wl…wr]. If the intersections exist, the approximate string matching occurs on Ti-3…Ti-3+m, for all i I.
14
ExampleEx : T=actgacctcagctta P=ctga k=1
1515 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa
Suffix Tree of T
163 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa
Suffix Tree of TR
1715 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa
Suffix Tree of T
j=1v=Pj+1…Pm=taaw=Pj-1…P1=ε
V[vl….vr]={ε}
183 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=1v=Pj+1…Pm=taaw=Pj-1…P1=ε
V[vl….vr]={ε}W[vl….vr]={3,12,…,14}
I={ε}
1915 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=2v=Pj+1…Pm=aaw=Pj-1…P1=c
V[vl….vr]={ε}
203 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=2v=Pj+1…Pm=aaw=Pj-1…P1=c
V[vl….vr]={ε}W[vl….vr]={4,8,9,14,11}
I={ε}
2115 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=3v=Pj+1…Pm=aw=Pj-1…P1=tc
V[vl….vr]={15,5,1,10}
223 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=3v=Pj+1…Pm=aw=Pj-1…P1=tc
V[vl….vr]={15,5,1,10}W[vl….vr]={5,10,15}
I={15,5,10}
23
When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-j…Ti-j+m, for all i I.
T2…T6 ,T7…T11 ,T12…T15 。
T=actgacctcagcttaP=ctaa
2415 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=4v=Pj+1…Pm=εw=Pj-1…P1=atc
V[vl….vr]={15,5,…,13}
253 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=3v=Pj+1…Pm=εw=Pj-1…P1=atc
V[vl….vr]={15,5,…,13}W[vl….vr]={ε}
I={ε}
26
Range Query Problem
• In step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl … vr] and in the interval [wl … wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.
27
Problem Definition of Range Query
• Input : Let V=[v1,v2 … vn] and W=[w1,w2 … wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n.
• Output : Find the intersection of elements of V[i … i+k] and W[j … j+l].
28
Example :V=[8,5,1,4,3,7,6,2]W=[3,6,4,7,2,1,5,8]i=3,k=4j=2,l=5
Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]
29
Preprocessing
V=
W=
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8
8 5 1 4 3 7 6 2
3 6 4 7 2 1 5 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
30
Preprocessing
V=
W=
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8
8 5 1 4 3 7 6 2
3 6 4 7 2 1 5 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
8
5
1
4
3
6
7
2
31
Preprocessing
V=
W=
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8
8 5 1 4 3 7 6 2
3 6 4 7 2 1 5 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
8
5
1
4
3
7
6
2
The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}.
32
Time Complexity of Range Query Problem
• By using Overmars’ algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range.
[O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.
)log( nkO )log( nnO
33
Time Complexity
• For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.
)log( nnO)log( nmtoccO
34
The Dictionary Matching Problem
35
Problem Definition
• The Dictionary Matching Problem– Input :
• 1. A dictionary P = {p1,…., ps}, where pi, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns.
• 2. A Text T of length n over alphabet Σ.• 3. An integer k.
– Output :• All occurrences of any dictionary patterns in T with a
t most k mismatches.
isi pd 1
36
Main idea
• In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.
37
Processing
• 1. Construct a suffix tree SD of string D and suffix tree SD
R of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.
38
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
39
a$ g tc
gc
tga$gca$
a$
gctga$gca$
a$
tga$gca$
a$gca$
a$
c
tga$gca$
ca$gctga$gca$
ga$gca$
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD)
40
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
a g tc
g$a
gt
cg
$a
ct
$
cg
$$
t$
gtcg$act$
agtcg$act$
act$
act$
agtcg$act$
tcg$act$
cg$act$
$t$
Suffix Tree of DR (SDR)
41
Processing
• 2. Modify suffix tree SD, and SDR respectivel
y, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ$σ”, where σ≠ε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with $σ’.
42
a$ g tc
a$
tga$
a$
a$
c
tga$
ca$
ga$
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD)
43
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
a g tc
g$
cg
$$
t$
gtcg$
tcg$
cg$$t
$
Suffix Tree of DR (SDR)
44
Preprocessing
• 3. Scan suffix tree SD, respectively SDR, and modi
fy as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.
4511
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD)
46
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)
47
Query Processing
• For j = 1,…., n do– 1. Find node v, the location of the longest pref
ix of tj+1 … tn in SD.– 2. Find node w, the location of the longest pre
fix of tj-1 … t1 in SDR.
– 3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SD
R.
48
Example
T=acagccgaD={tca,gctga,gca}K=1
4911
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
50
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
5111
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=1v=Tj+1…Tm=cagccgaw=Tj-1…T1=ε
V={10,2}
52
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$t
$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=1v=Tj+1…Tm=cagccgaw=Tj-1…T1=ε
V={10,2}W={13,5,…, 8}I={10,2}
53
When j=1, the intersection of V[10,2] and W[13,5,…,8] is I={10,2}. Therefore approximate string matching occurs on T with P…P.
T=acagccgaP={tca,gctga,gca}
5411
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=2v=Tj+1…Tm=agccgaw=Tj-1…T1=a
V={11,8,3}
55
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=2v=Tj+1…Tm=agccgaw=Tj-1…T1=a
V={11,8,3}W={ε}I={ε}
5611
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=3v=Tj+1…Tm=gccgaw=Tj-1…T1=ca
V={ε}
57
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=3v=Tj+1…Tm=gccgaw=Tj-1…T1=ca
V={ε}W={ε}I={ε}
5811
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=4v=Tj+1…Tm=ccgaw=Tj-1…T1=aca
V={ε}
59
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=4v=Tj+1…Tm=ccgaw=Tj-1…T1=aca
V={ε}W={ε}I={ε}
6011
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=5v=Tj+1…Tm=cgaw=Tj-1…T1=gaca
V={ε}
61
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=5v=Tj+1…Tm=cgaw=Tj-1…T1=gaca
V={ε}W={6,11}I={ε}
6211
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=6v=Tj+1…Tm=gaw=Tj-1…T1=cgaca
V={7}
63
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=6v=Tj+1…Tm=gaw=Tj-1…T1=cgaca
V={7}W={7,12}I={7}
64
When j=6, the intersection of V[7] and W[7,12] is I={7}. Therefore approximate string matching occurs on T with P.
T=acagccgaP={tca,gctga,gca}
6511
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=7v=Tj+1…Tm=εw=Tj-1…T1=ccgaca
V={11,8,…,6}
66
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=1v=Tj+1…Tm=εw=Tj-1…T1=ccgaca
V={11,8,…,6}W={ε}I={ε}
67
Time Complexity
• For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the dictionary in the text with one error.
)log( ddO)log( dntoccO
68
References[AC75] Efficient string matching, A. V. Aho and M. J. Corasick, Comm. Assoc. Comput. Mach. 18, No.
6, 1975, pp.333-340.[AF91] Adaptive dictionary matching, A. Amir and M. Farach, Proc. 32nd IEEE FOCS, 1991, pp.760-7
66.[AFI95] A. Amir, M. Farach, R. M. Idury and etc, Improved dynamic dictionary matching , Inform. And
Comput. 119, 1995, pp.258-282.[AKL99] A. Amir, D. Keselman, G. M. Landau and etc, Indexing and dictionary matching with one erro
r, Proc. 1999 Workshop on Algorithms and Data Structures (WADS), 1999, pp.181-192.[AG97] Pattern Matching Algorithms, Oxford University Press, 1997.[BM77] A fast string searching algorithm, Comm. Assoc. Comput. Mach. 20, 1977, pp.762-772.[BG96] G. S. Brodal and L. Gasieniec, Approximate Dictionary queries, Proc. 7th Annual Symposium
on Combinatorial Pattern Matching (CPM96), 1996, pp.65-74.[O88] M. H. Overmars, Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,p
p. 254-275.
69
Thanks for your listening