dictionary matching and indexing with edits and don’t cares
DESCRIPTION
Dictionary Matching and Indexing with Edits and Don’t Cares. Richard Cole NYU. Lee-Ad Gottlieb NYU. Moshe Lewenstein Bar-Ilan. Pattern Matching. Various problems of the following flavor: Preprocess a text t , or a collection of strings d 1 ,…,d x , - PowerPoint PPT PresentationTRANSCRIPT
Dictionary Matching and Indexing with Edits and Don’t
Cares
Richard ColeNYU
Lee-Ad GottliebNYU
Moshe LewensteinBar-Ilan
Pattern Matching
Various problems of the following flavor:
Preprocess a text t,or a collection of strings d1,…,dx,
so that given a query string p, all matches with the text can be found quickly.
IndexingDictionary queries
Dictionary matchingAll-to-all matching
Pattern Matching
Dictionary queries.
Bate Beat Boat Boot
Beta
Pattern Matching
Dictionary matching.
Bate Beat Boat Boot
The fish beat my boot.
Pattern Matching
Text indexing.
abracadabra
ra ra
Pattern Matching
All-to-all matching.
Bate Beat Boat Boot
bat boots be
Previous Work
a
t
e o
o
t
Bate BeatBoat Boot
aa
e
t
b
t
Beta
Dictionary Queries
Previous Work
a
t
e o
o
t
Bate BeatBoat Boot
aa
e
t
b
t
Beta
Dictionary Queries
Suffix Treeg o
o
g
Oogogoogogogoggogogg
g
oogo
g
o
g
o
g
Text Indexing
Suffix Treeg o
o
g
Oogogoogogogoggogogg
g
oogo
g
o
g
o
g
Text Indexing
Suffix Treeg o
o
g
Oogogoogogogoggogogg
g
oogo
g
o
g
o
g
Text Indexing
Suffix Treeg o
o
g
Oogogoogogogoggogogg
g
oogo
g
o
g
o
g
Text Indexing
Approximate Matches
Wildcards (don’t cares)BoatBo*t
SubstitutionsBoatBoot
Edits – insertions and deletionsBoatB_at
Previous Work – Best Results
Indexing and Dictionary Matching (edits) Buchsbaum, Goodrich, Westbrook.
k=1 p log log n + occ query timen log n space
Dictionary Queries (substitutions) Brodal, Gasieniec.
k=1 p + occ query timen space
Previous Work – Basic Intuition
abracadabra Build a suffix tree for
abracadab abracada abracad abraca abrac abra abr ab a
abracadabra And for
a ar arb arba arbad arbada arbadac arbadaca arbadacar
abrac*dabra
New Results
Indexing, Dictionary Queries, Dictionary Matches Substitutions
k < log n p + [(c1log n)k log log n] / k! + occ query timen(c2log n)k / k! space
Editsk < log n p + [(c3log n)k log log n] / k!
+ 3kocc query timen(c4log n)k / k! space
Wildcards in patternk < log n p + 2klog log n / k! + occ query time
n + (k+log n)k / k! space
Dictionary Wildcard Queries
Three data structures for dictionary wildcard queries
Naïve: O(n) space kp query time
Less-naïve: O(n1+k) p
New data structure: O(n logkn) 2kp
Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
Query time:k p
Less-Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Less-Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
i
n
a
y
*
tr t
Less-Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
i
n
a
y
*
tr t
Query string:*it
Less-Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
tr t
Less-Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
tr t
Less-Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
tr t
Query time:p
Less-Naïve Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
*
*
*
Space:O(n1+k)
*
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
i
n
a
y
*
t
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
t
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
t
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
t
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
t
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
t
New Approach
f
a
rt
i
t
p
i
n
a
y
s
i
t
Query string:*it
i
n
a
y
*
t
Query time:2kp
Space Analysis
Create a wildcard subtree at each node in the original trie. heaviest child is not in the wildcard tree.
Look at any leaf of the trie How many of its ancestors were not the heaviest child?
log2n So it appears in at most log n wildcard trees.
Space: n log n n logkn
Edit Distance
Wildcards is (algorithmically) the simplest type of approximate search.
What issues come up when dealing with substitutions, insertions and deletions?
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Search
a
a
a
b
b
b
a
a
Query string:aab
Substitution Tree
a
a
a
b
b
b
a
a
Query string:aab
Substitution Tree
a
a
a
b
b
b
a
a a
a
a
Query string:aab
Deletion Tree
a
a
a
b
b
c
a
a
Deletion tree
Deletion Tree
a
a
a
b
b
c
a
a
c
bDeletion tree!
Insertion Tree
a
a
a
b
b
c
a
a
Insertion tree
Insertion Tree
a
a
a
b
b
c
a
a
a
c
b
Insertion tree!
Grouping
a
a
a
b
b
b
a
a a
a
a
Query string:aab
Grouping
a
a
a
b
b
b
a
a a
a
a
Query string:aab
b
a
Grouping
a
a
a
b
b
b
a
a a
a
a
Query string:aab
b
a
a
Grouping
a
a
a
b
b
b
a
a a
a
a
Query string:aab
bGrouping!
Analysis
Can’t merge along all possible paths of original trie – too expensive.
Merge along centroid paths. Centroid paths always follow the heaviest child.
Any path from root to leaf traverses at most log n centroid paths.
Analysis
Analysis
Analysis
Analysis
Grouping
Grouping
Grouping
Grouping
Grouping
Grouping
Suppose a search reached up to the 7th edge with no
substitutions.
Grouping
Suppose a search reached up to the 7th edge with no
substitutions.
Grouping
Suppose a search reached up to the 7th edge with no
substitutions.
Grouping
…then we searchonly three
substitution trees.
Space increase:log n factor
Suppose a search reached up to the 7th edge with no
substitutions.
Analysis w1
w2
w3
w4
log n searches
log n searches
log n searches
Total number of searches:log n * log n = log2 n
Analysis
For k=1 For each centroid path traversed, log n substitution
subtree searches. A path to a leaf traverses at most log n centroid
paths. log2n searches log n searches using balanced
grouping.
More generally logkn searches Using a Y-fast trie, each search takes log log n time
logkn log log n
More Rigorous Analysis
Balanced SearchTree
More Rigorous Analysis
Weight Balanced Search Tree
More Rigorous Analysis
Weight Balanced Search Tree
More Rigorous Analysis
Weight Balanced Search Tree
More Rigorous Analysis
Weight Balanced Search Tree
More Rigorous Analysis
Weight Balanced Search Tree
O(log(W/w)) levels
More Rigorous Analysis
For a segment of a centroid path whose top has weight W and bottom has weight w we do about log (W/w) searches
Analysis w1
w2
w3
w4
log(w1/w2) searches
log(w2/w3) searches
log(w3/w4) searches
Total number of searches:log(w1/w2) + log(w2/w3) log(w3/w4) =log(w1/w4)
More Rigorous Analysis
Time for one match: logkn log log n / k!
Space: n(c log n)k / k! for some constant c
Open Problem
Dynamic search structure. Requires a less strict notion of “centroid path”?