suffix arrays a new method for online string searches u.manber and g.myers

Suffix Arrays

A New Method for Online String Searches

U.Manber and G.Myers

Introduction - String matching

Let A = a0a1...aN-1 be a large text of length N

Let W = w0w1...wp-1 be a word of length P

Is W a substring of A?

Introduction - Suffix Trees

Build time O(N)

Search time O(P)

Structure space O(N) Big constant

Dependent of |Σ|

Suffix Arrays

An array of all the suffixes of A Sorted by lexicographical order

A = aababa

aaababaabaabababababa

A = aababa

Ai = aiai+1...aN-1 The suffix of A that starts at position i.

Position array (Pos) Pos[k] is the start position of kth smallest suffix APos[k] is the suffix pointed from Pos[k] APos[k] is the kth smallest suffix

Pos 012345

Suffix Arrays

503142012345

Searching

“Is W a substring of A?” W is a substring of A

Some suffix Ai starts with W i is W’s location All the instances of W must match

consecutive suffixes in the array Find the array interval that contains

those suffixes

Searching - Definitions

For a string u up = u0u1...up-1

For strings u,v u ≤p v up ≤ vp

Same for ≠, =, >… For any p, Pos is ordered according

to ≤p

Searching - Definitions

W = w0w1…wP-1

LW = min (k : W ≤p APos[k] or k = N) First suffix ≥p from W

RW = max (k : APos[k] ≤p W or k = 1) Last suffix ≤p from W

LW RW

W >p APos[k] W < p APos[k]

W =p APos[k]

Search Algorithm

k [LW, RW] W =p APos[k] To find W’s instances - find [LW, RW] Number of W’s occurrences is

(RW-LW+1)

Matches are APos[LW],…, APos[RW]

Suffix array is sorted - use binary search

Binary Search

Search interval [L,R] Midpoint M Compare W to APos[M]

Decide where to search next W ≤p APos[M] - search in left half (R = M)

W >p APos[M] - search in right half (L = M) O(PlogN)

aababcbcdcbbW = abc

L M R

Search Algorithm

Observation: We can use information from one

comparison to speedup the next comparisons

Use additional information lcp = longest common prefix

Search Algorithm - lcp

lcp(v,w) = the length of the longest common prefix of v and w

Obtained by comparing v and w and stopping at the first unequal symbol

Use precomputed lcp information to reduce the number of comparisons to O(P + logN)

Search Algorithm

Consider all possible midpoints M = 1…N-2

Every midpoint corresponds to a triplet [LM,M,RM]

Suppose we precomputed two arrays: Llcp[M] = lcp (APos[LM], APos[M])

Rlcp[M] = lcp (APos[M], APos[RM])

Search Algorithm

Maintain two more variables l = lcp(APos[L], W)

r = lcp(W, APos[R])

W = abcd

abaaabbabcabcdacacaacbacdad

l = 2 r = 1

LM RMM

Llcp[M] = 1 Rlcp[M] = 1

Search Algorithm

Assume l≥r Compare l with Llcp

If l < Llcp[M] W >l+1 APos[LM]

APos[LM] =l+1 APos[M]

W >l+1 APos[M]

abaabaaababababaabacabcdacacdad

l = 2 r = 1

LM RMM


W = abcd

Go Right!l remains unchanged

Search Algorithm

If l > Llcp[M] APos[LM] <l APos[M]

W =l APos[LM]

W <l APos[M]

abaabcdabdacacaadadaadbadc

l = 2 r = 1

LM RMM


W = abcd

Go Left!r = Llcp[M]

Search Algorithm

If l = Llcp[M] W can be in either half Start comparing A and APos[M] from the (l+1)

symbol First unequal symbol determines whether to go

right or left r/l will be updated to l+j j+1 comparisons

ababaabaaabcabccabcdadaadbadc

l = 2 r = 1

LM RMM


W = abcd

Search Algorithm - Complexity

In each Iteration: Let h=max(l,r) We start comparing from the hth symbol

to the h+j+1 j+1 symbol comparisons Next time we will start from the h+j

symbol j symbols out of the j+1 will not be

compared again

Search Algorithm - Complexity

Every symbol in W will be successfully matched at most once O(P) successful comparisons

At most one symbol will be unsuccessfully matched in each iteration O(logN) unsuccessful comaprsions

Total: O(P + logN) comparisons

Build Suffix Array

So far… A O(P + logN) search algorithm Given a sorted suffix array Given lcp information (Llcp, Rlcp)

Next… Sort the suffix array in O(NlogN) Compute the lcp’s while sorting the

array

Sort Algorithm

First stage Sort the suffixes into buckets, according to first

symbol

Inductive stage Assume array is bucket sorted according to

first H symbols Every H-bucket holds suffixes with the same H

first symbols Buckets are ordered according to the ≤H

relation Sort according to 2H first symbols

Sort Algorithm – Intuition

Let Ai, Aj be two suffixes in the same H-bucket

Ai =H Aj

Next H symbols of Ai and Aj are the first H symbols of Ai+H and Aj+H

In order to determine the ≤2H order of Ai and Aj, look at the ≤H order of Ai+H and Aj+H

aaaaababaaabaaababaababaabaa

A = aababaa

H = 2

Ai AjAj+H

Ai+H

Sort Algorithm – Main Idea

Let Ai be a suffix in the first H-bucket

Ai starts with the smallest H-symbol string

Ai-H should be the first in its 2H-bucket

aababaabaaabababababa

A = aababa H = 1

Sort Algorithm

In stage H Go over all the suffixes in the ≤H order

For each Ai move Ai-H to the next available place in its H-bucket

The suffixes are now sorted according to the ≤2H order

Go on to stage 2H to produce ≤4H order

in

Sort Algorithm - Example

01234567

n

A = assassin

sin

A3A0A6A7A1A5A4A2

ssassinssinsassinassassin assin

sassinssinsinssassinninassassinassinH = 1

H = 2

Sort Algorithm - Example

A = assassin5 6210 743

A0A3A6A7A2A5A4A1

ssassinssinsinsassinninassinassassinH = 2

H = 4 ssinssassinsinsassinninassinassassin

A0A3A6A7A2A5A1A4

Sort Algorithm - Complexity

First Stage Bucket sort according to first symbol O(NlogN)

Inductive Stages O(logN) stages O(N) per stage

Total O(NlogN)

Space Can be implemented using two N-sized integer

arrays

Finding Longest Common Prefixes

The search algorithm uses lcp information: Llcp[M] = lcp (APos[LM], APos[M])

Rlcp[M] = lcp (APos[M], APos[RM])

We want to compute this information while we are sorting the array

Finding Longest Common Prefixes

Show how to compute lcp’s for suffixes in adjacent H-buckets during the sort algorithm

Use that to compute the lcp’s of all the suffixes that are consecutive in the sorted suffix array

Show how to compute lcps for all the necessary suffixes

Finding LCP for adjacent buckets

After the first sort stage, lcp’s of suffixes in adjacent buckets is 0

Assume after stage H we know the lcps between suffixes in adjacent H-buckets

Suppose Ap and Aq are in the same H-bucket but not in the same 2H bucket H ≤ lcp(Ap, Aq) < 2H lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) lcp(Ap+H, Aq+H) < H

Let i,j be Ap+H, Aq+H’s positions in the suffix array

Assume i<j Array is ordered according to the <H order

lcp(APos[i], APos[j]) = min(lcp(APos[k-1], APos[k]))

Finding LCP for adjacent buckets

k [i+1,j]

aababaabaaabababababaH = 1

i j2 1 0

LCP Data Structures – Hgt][

We need a data structure that will allow us: get the lcp’s of consecutive suffixes get their minimum

Hgt[] – an N-1 sized array Hgt[i] = lcp(APos[i-1], APos[i])

Hgt will be computed inductively throughout the sort Initialized to N+1 Hgt[i] is updated in stage 2H

APos[i] started a new 2H-bucket To update Hgt[i]:

Let a,b be the array positions of APos[i-1]+H

and APos[i] +H

Assume a≤b Hgt[i] = H + min(Hgt[k])

LCP Data Structures – Hgt][

k [a+1,b]

lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1

lcp(sassin,sin) = 1 + lcp(assin, in) = 1

Finding LCP - Example

assinassassininnssassinsinssinsassin

assassinassininnsassinsinssinssassin

0 0 0 9 999

1 10 0 0 99

0 0 0 1 1 23

H = 2

H = 1

assassinassininnsassinsinssassinssinH = 4

We need the following operations for Hgt[]: Set(i, h) – sets Hgt[i] to h Min_height(i,j) – determines min(Hgt[k])

We need to find a way to find the lcp’s for all the necessary suffixes – not just the ones in consecutive positions

k [i,j]

LCP Data Structures - Interval Tree


A full and balanced binary tree N-1 leaves, correspond to Hgt[] O(logN) height, N-2 interior vertices Keep a Hgt value for each interior

vertex as well: Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])


Operations implementation: Set(i,h)

Set Hgt[i] to h and update the Hgt values on the path from i to the root

Min-height(i,j) Finds the minimal Hgt value by scanning

O(logN) vertices in the tree

Operations complexity – O(logN)

Finding LCP – Interval Tree

(2,3) (3,4) (4,5) (5,6) (6,7)(1,2)(0,1)

0

9 0 0 0

0 0

9

0

9 9

9

9

1

1

1

Finding LCP - Complexity

In stage 2H we update Hgt[i] for all the leaves that started new buckets Each update is one set operation and one

Min_height - O(logN) Throughout the algorithm every leaf is updated

exactly once - O(N) updates Updates complexity: O(NlogN)

In each stage we scan the array to see which suffixes opened new buckets Scans complexity: O(NlogN)

Total LCP complexity O(NlogN)

Finding LCP - Llcp][ and Rlcp][

We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort

Use an interval tree that represents a binary search Each interior node corresponds to (LM, RM) for some

M For each interior node (LM, RM)

Left(LM, RM) = (LM,M) Right(LM, RM) = (M, RM)

N-2 interior nodes Leaves correspond to (i-1,i) Leaf(i-1,i) = Hgt[i]

Finding LCP - Llcp][ and Rlcp][

According to interval tree structure: Hgt[(L,R)] = min(Hgt[k])

Hgt[(L,R)] = lcp (APos[L], APos[R])

Llcp[M] = Hgt[(LM,M)]

Rlcp[M] = Hgt[(M,RM)]

k [L+1,R]

Worst Case Complexity

Suffix Array Build time

O(NlogN) Search time

O(P+logN) Structure space

O(N) 2N - 3N integers

Independent of |Σ|

Suffix Tree Build time

O(N) Search time

O(P) Structure space

O(N) Big constant

Dependent of |Σ|

Expected Time Improvements

Improve the expected case time of Search Algorithm Sort Algorithm LCP computation

Use the following assumptions All N-symbol strings are equally likely Under this assumption:

Expected length of longest repeated substring of A is O(log|Σ|N)

Expected Case Improvements - Main Idea

Let T = Let IntT(u) = integer encoding in base |Σ| of the T-

symbol prefix of u Example:

T = 3 Σ = a,b u = abaa IntT(u) = 010 = 2

There are |Σ|T ≤ N possible T-symbol prefixes IntT(u) is a number in [0,N-1]

Map each suffix Ap to IntT(Ap) Can be done in O(N) time

Nlog

Expected Case Improvements - Search Algorithm

Use an additional array Buck[] Think of the sorted array as buckets,

based on the IntT encoding

Buck[k] = min{ i | IntT (APos[i]) = k} The first position that contains a suffix

that’s mapped to k

Compute Buck[] at the end of the sort algorithm O(N) additional time


Given a word W We need to find Lw and Rw

Let k = IntT(W)

Lw and Rw must be in k’s bucket (Buck[k], Buck[k+1])

We only need to search one bucket


Number of buckets = |Σ|T ≤ N Average number of elements in a

bucket = O(1) In the binary search for W

Expected size of bucket to search = O(1)

Expected number of search steps: O(1) Expected case time: O(P)

Expected Case Improvements - Sort Algorithm

First stage of sort Sort according to first symbol

Replace first stage with sort according to IntT

Equivalent to sort according to first T symbols

Can be done in O(N) time We changed the base case of the sort

from H=1 to H=T


Observation: Let C be the length of the longest

repeated substring of A Sort is in fact complete once we have

reached (C+1)-buckets Suppose some (C+1)-bucket contains more

than one suffix Then we have two suffixes with lcp > C This prefix is a repeated substring longer than

C - contradiction


Expected case: C = O(log|Σ|N) = O(T) Number of stages: O(1)

Expected case time: O(N)

Expected Case Improvements - LCP Computation

Replace interval tree with sort history Binary tree Models the refinement of buckets

during the sort A vertex for each H-bucket Each vertex holds the stage number

at which its bucket was split


Leaves correspond to suffixes and are arranged in an N element array

Each vertex has at least two children

O(N) nodes Can be built with O(N) additional

time during the sort


Given the sort history we can compute lcp(Ap, Aq) Find the nca (nearest common

ancestor) of Ap and Aq

Let H be the nca’s stage number lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)

Recursively compute lcp(Ap+H, Aq+H) Stop when the nca is the root


Each step is O(1) At each step the stage number of the nca

is at least halved Suppose we stop the recursion when

H < T’ =

Expected length of longest repeated substring is O(T) Expected case lcp is O(T) = O(log|Σ|N)

Nlog2

1


O(1) recursive steps in the expected case

Expected case time for one lcp: O(1) Expected case time for computing

Llcp[], Rlcp[]: O(N)


We need a way to find lcp’s that are known to be less than T’

Build a |Σ|T’ x |Σ|T’ array: Lookup[IntT’(x), IntT’(y)] = lcp(x,y) for all

T’-symbol strings x,y Max N entries (|Σ|T’ = √N) Compute incrementally in O(N) Final recursion steps are replaced by

O(1) lookup

Expected Time Complexity

Search time O(P)

Sort + LCP computation time O(N)

suffix arrays a new method for online string searches u.manber and g.myers

Documents