dictionary matching and indexing with edits and don’t cares

Dictionary Matching and Indexing with Edits and Don’t

Cares

Richard ColeNYU

Lee-Ad GottliebNYU

Moshe LewensteinBar-Ilan

Pattern Matching

Various problems of the following flavor:

Preprocess a text t,or a collection of strings d1,…,dx,

so that given a query string p, all matches with the text can be found quickly.

IndexingDictionary queries

Dictionary matchingAll-to-all matching

Pattern Matching

Dictionary queries.

Bate Beat Boat Boot

Beta

Pattern Matching

Dictionary matching.

Bate Beat Boat Boot

The fish beat my boot.

Pattern Matching

Text indexing.

abracadabra

ra ra

Pattern Matching

All-to-all matching.

Bate Beat Boat Boot

bat boots be

Previous Work

a

t

e o

o

t

Bate BeatBoat Boot

aa

e

t

b

t

Beta

Dictionary Queries

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Approximate Matches

Wildcards (don’t cares)BoatBo*t

SubstitutionsBoatBoot

Edits – insertions and deletionsBoatB_at

Previous Work – Best Results

Indexing and Dictionary Matching (edits) Buchsbaum, Goodrich, Westbrook.

k=1 p log log n + occ query timen log n space

Dictionary Queries (substitutions) Brodal, Gasieniec.

k=1 p + occ query timen space

Previous Work – Basic Intuition

abracadabra Build a suffix tree for

abracadab abracada abracad abraca abrac abra abr ab a

abracadabra And for

a ar arb arba arbad arbada arbadac arbadaca arbadacar

abrac*dabra

New Results

Indexing, Dictionary Queries, Dictionary Matches Substitutions

k < log n p + [(c1log n)k log log n] / k! + occ query timen(c2log n)k / k! space

Editsk < log n p + [(c3log n)k log log n] / k!

+ 3kocc query timen(c4log n)k / k! space

Wildcards in patternk < log n p + 2klog log n / k! + occ query time

n + (k+log n)k / k! space

Dictionary Wildcard Queries

Three data structures for dictionary wildcard queries

Naïve: O(n) space kp query time

Less-naïve: O(n1+k) p

New data structure: O(n logkn) 2kp

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Query time:k p

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t


f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

tr t


f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

tr t

Query string:*it


f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t


f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t

Query time:p


f

a

rt

i

t

p

i

n

a

y

s

i

t

*

*

*

Space:O(n1+k)

*

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Query time:2kp

Space Analysis

Create a wildcard subtree at each node in the original trie. heaviest child is not in the wildcard tree.

Look at any leaf of the trie How many of its ancestors were not the heaviest child?

log2n So it appears in at most log n wildcard trees.

Space: n log n n logkn

Edit Distance

Wildcards is (algorithmically) the simplest type of approximate search.

What issues come up when dealing with substitutions, insertions and deletions?

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Substitution Tree

a

a

a

b

b

b

a

a

Query string:aab

Substitution Tree

a

a

a

b

b

b

a

a a

a

a

Query string:aab

Deletion Tree

a

a

a

b

b

c

a

a

Deletion tree

Deletion Tree

a

a

a

b

b

c

a

a

c

bDeletion tree!

Insertion Tree

a

a

a

b

b

c

a

a

Insertion tree

Insertion Tree

a

a

a

b

b

c

a

a

a

c

b

Insertion tree!

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

b

a

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

b

a

a

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

bGrouping!

Analysis

Can’t merge along all possible paths of original trie – too expensive.

Merge along centroid paths. Centroid paths always follow the heaviest child.

Any path from root to leaf traverses at most log n centroid paths.

Analysis

Grouping

Grouping

Suppose a search reached up to the 7th edge with no

substitutions.

Grouping

…then we searchonly three

substitution trees.

Space increase:log n factor

Suppose a search reached up to the 7th edge with no

substitutions.

Analysis w1

w2

w3

w4

log n searches

log n searches

log n searches

Total number of searches:log n * log n = log2 n

Analysis

For k=1 For each centroid path traversed, log n substitution

subtree searches. A path to a leaf traverses at most log n centroid

paths. log2n searches log n searches using balanced

grouping.

More generally logkn searches Using a Y-fast trie, each search takes log log n time

logkn log log n

More Rigorous Analysis

Balanced SearchTree


Weight Balanced Search Tree


Weight Balanced Search Tree

O(log(W/w)) levels


For a segment of a centroid path whose top has weight W and bottom has weight w we do about log (W/w) searches

Analysis w1

w2

w3

w4

log(w1/w2) searches

log(w2/w3) searches

log(w3/w4) searches

Total number of searches:log(w1/w2) + log(w2/w3) log(w3/w4) =log(w1/w4)


Time for one match: logkn log log n / k!

Space: n(c log n)k / k! for some constant c

Open Problem

Dynamic search structure. Requires a less strict notion of “centroid path”?

dictionary matching and indexing with edits and don’t cares

Documents