suffix arrays a new method for online string searches u.manber and g.myers
Post on 19-Dec-2015
216 views
TRANSCRIPT
Suffix Arrays
A New Method for Online String Searches
U.Manber and G.Myers
Introduction - String matching
Let A = a0a1...aN-1 be a large text of length N
Let W = w0w1...wp-1 be a word of length P
Is W a substring of A?
Introduction - Suffix Trees
Build time O(N)
Search time O(P)
Structure space O(N) Big constant
Dependent of |Σ|
Suffix Arrays
An array of all the suffixes of A Sorted by lexicographical order
A = aababa
aaababaabaabababababa
A = aababa
Ai = aiai+1...aN-1 The suffix of A that starts at position i.
Position array (Pos) Pos[k] is the start position of kth smallest suffix APos[k] is the suffix pointed from Pos[k] APos[k] is the kth smallest suffix
Pos 012345
Suffix Arrays
503142012345
Searching
“Is W a substring of A?” W is a substring of A
Some suffix Ai starts with W i is W’s location All the instances of W must match
consecutive suffixes in the array Find the array interval that contains
those suffixes
Searching - Definitions
For a string u up = u0u1...up-1
For strings u,v u ≤p v up ≤ vp
Same for ≠, =, >… For any p, Pos is ordered according
to ≤p
Searching - Definitions
W = w0w1…wP-1
LW = min (k : W ≤p APos[k] or k = N) First suffix ≥p from W
RW = max (k : APos[k] ≤p W or k = 1) Last suffix ≤p from W
LW RW
W >p APos[k] W < p APos[k]
W =p APos[k]
Search Algorithm
k [LW, RW] W =p APos[k] To find W’s instances - find [LW, RW] Number of W’s occurrences is
(RW-LW+1)
Matches are APos[LW],…, APos[RW]
Suffix array is sorted - use binary search
Binary Search
Search interval [L,R] Midpoint M Compare W to APos[M]
Decide where to search next W ≤p APos[M] - search in left half (R = M)
W >p APos[M] - search in right half (L = M) O(PlogN)
aababcbcdcbbW = abc
L M R
Search Algorithm
Observation: We can use information from one
comparison to speedup the next comparisons
Use additional information lcp = longest common prefix
Search Algorithm - lcp
lcp(v,w) = the length of the longest common prefix of v and w
Obtained by comparing v and w and stopping at the first unequal symbol
Use precomputed lcp information to reduce the number of comparisons to O(P + logN)
Search Algorithm
Consider all possible midpoints M = 1…N-2
Every midpoint corresponds to a triplet [LM,M,RM]
Suppose we precomputed two arrays: Llcp[M] = lcp (APos[LM], APos[M])
Rlcp[M] = lcp (APos[M], APos[RM])
Search Algorithm
Maintain two more variables l = lcp(APos[L], W)
r = lcp(W, APos[R])
W = abcd
abaaabbabcabcdacacaacbacdad
l = 2 r = 1
LM RMM
Llcp[M] = 1 Rlcp[M] = 1
Search Algorithm
Assume l≥r Compare l with Llcp
If l < Llcp[M] W >l+1 APos[LM]
APos[LM] =l+1 APos[M]
W >l+1 APos[M]
abaabaaababababaabacabcdacacdad
l = 2 r = 1
LM RMM
Llcp[M] = 3 Rlcp[M] = 1
W = abcd
Go Right!l remains unchanged
Search Algorithm
If l > Llcp[M] APos[LM] <l APos[M]
W =l APos[LM]
W <l APos[M]
abaabcdabdacacaadadaadbadc
l = 2 r = 1
LM RMM
Llcp[M] = 1 Rlcp[M] = 1
W = abcd
Go Left!r = Llcp[M]
Search Algorithm
If l = Llcp[M] W can be in either half Start comparing A and APos[M] from the (l+1)
symbol First unequal symbol determines whether to go
right or left r/l will be updated to l+j j+1 comparisons
ababaabaaabcabccabcdadaadbadc
l = 2 r = 1
LM RMM
Llcp[M] = 2 Rlcp[M] = 1
W = abcd
Search Algorithm - Complexity
In each Iteration: Let h=max(l,r) We start comparing from the hth symbol
to the h+j+1 j+1 symbol comparisons Next time we will start from the h+j
symbol j symbols out of the j+1 will not be
compared again
Search Algorithm - Complexity
Every symbol in W will be successfully matched at most once O(P) successful comparisons
At most one symbol will be unsuccessfully matched in each iteration O(logN) unsuccessful comaprsions
Total: O(P + logN) comparisons
Build Suffix Array
So far… A O(P + logN) search algorithm Given a sorted suffix array Given lcp information (Llcp, Rlcp)
Next… Sort the suffix array in O(NlogN) Compute the lcp’s while sorting the
array
Sort Algorithm
First stage Sort the suffixes into buckets, according to first
symbol
Inductive stage Assume array is bucket sorted according to
first H symbols Every H-bucket holds suffixes with the same H
first symbols Buckets are ordered according to the ≤H
relation Sort according to 2H first symbols
Sort Algorithm – Intuition
Let Ai, Aj be two suffixes in the same H-bucket
Ai =H Aj
Next H symbols of Ai and Aj are the first H symbols of Ai+H and Aj+H
In order to determine the ≤2H order of Ai and Aj, look at the ≤H order of Ai+H and Aj+H
aaaaababaaabaaababaababaabaa
A = aababaa
H = 2
Ai AjAj+H
Ai+H
Sort Algorithm – Main Idea
Let Ai be a suffix in the first H-bucket
Ai starts with the smallest H-symbol string
Ai-H should be the first in its 2H-bucket
aababaabaaabababababa
A = aababa H = 1
Sort Algorithm
In stage H Go over all the suffixes in the ≤H order
For each Ai move Ai-H to the next available place in its H-bucket
The suffixes are now sorted according to the ≤2H order
Go on to stage 2H to produce ≤4H order
in
Sort Algorithm - Example
01234567
n
A = assassin
sin
A3A0A6A7A1A5A4A2
ssassinssinsassinassassin assin
sassinssinsinssassinninassassinassinH = 1
H = 2
Sort Algorithm - Example
A = assassin5 6210 743
A0A3A6A7A2A5A4A1
ssassinssinsinsassinninassinassassinH = 2
H = 4 ssinssassinsinsassinninassinassassin
A0A3A6A7A2A5A1A4
Sort Algorithm - Complexity
First Stage Bucket sort according to first symbol O(NlogN)
Inductive Stages O(logN) stages O(N) per stage
Total O(NlogN)
Space Can be implemented using two N-sized integer
arrays
Finding Longest Common Prefixes
The search algorithm uses lcp information: Llcp[M] = lcp (APos[LM], APos[M])
Rlcp[M] = lcp (APos[M], APos[RM])
We want to compute this information while we are sorting the array
Finding Longest Common Prefixes
Show how to compute lcp’s for suffixes in adjacent H-buckets during the sort algorithm
Use that to compute the lcp’s of all the suffixes that are consecutive in the sorted suffix array
Show how to compute lcps for all the necessary suffixes
Finding LCP for adjacent buckets
After the first sort stage, lcp’s of suffixes in adjacent buckets is 0
Assume after stage H we know the lcps between suffixes in adjacent H-buckets
Suppose Ap and Aq are in the same H-bucket but not in the same 2H bucket H ≤ lcp(Ap, Aq) < 2H lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) lcp(Ap+H, Aq+H) < H
Let i,j be Ap+H, Aq+H’s positions in the suffix array
Assume i<j Array is ordered according to the <H order
lcp(APos[i], APos[j]) = min(lcp(APos[k-1], APos[k]))
Finding LCP for adjacent buckets
k [i+1,j]
aababaabaaabababababaH = 1
i j2 1 0
LCP Data Structures – Hgt][
We need a data structure that will allow us: get the lcp’s of consecutive suffixes get their minimum
Hgt[] – an N-1 sized array Hgt[i] = lcp(APos[i-1], APos[i])
Hgt will be computed inductively throughout the sort Initialized to N+1 Hgt[i] is updated in stage 2H
APos[i] started a new 2H-bucket To update Hgt[i]:
Let a,b be the array positions of APos[i-1]+H
and APos[i] +H
Assume a≤b Hgt[i] = H + min(Hgt[k])
LCP Data Structures – Hgt][
k [a+1,b]
lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1
lcp(sassin,sin) = 1 + lcp(assin, in) = 1
Finding LCP - Example
assinassassininnssassinsinssinsassin
assassinassininnsassinsinssinssassin
0 0 0 9 999
1 10 0 0 99
0 0 0 1 1 23
H = 2
H = 1
assassinassininnsassinsinssassinssinH = 4
We need the following operations for Hgt[]: Set(i, h) – sets Hgt[i] to h Min_height(i,j) – determines min(Hgt[k])
We need to find a way to find the lcp’s for all the necessary suffixes – not just the ones in consecutive positions
k [i,j]
LCP Data Structures - Interval Tree
LCP Data Structures - Interval Tree
A full and balanced binary tree N-1 leaves, correspond to Hgt[] O(logN) height, N-2 interior vertices Keep a Hgt value for each interior
vertex as well: Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])
LCP Data Structures - Interval Tree
Operations implementation: Set(i,h)
Set Hgt[i] to h and update the Hgt values on the path from i to the root
Min-height(i,j) Finds the minimal Hgt value by scanning
O(logN) vertices in the tree
Operations complexity – O(logN)
Finding LCP – Interval Tree
(2,3) (3,4) (4,5) (5,6) (6,7)(1,2)(0,1)
0
9 0 0 0
0 0
9
0
9 9
9
9
1
1
1
Finding LCP - Complexity
In stage 2H we update Hgt[i] for all the leaves that started new buckets Each update is one set operation and one
Min_height - O(logN) Throughout the algorithm every leaf is updated
exactly once - O(N) updates Updates complexity: O(NlogN)
In each stage we scan the array to see which suffixes opened new buckets Scans complexity: O(NlogN)
Total LCP complexity O(NlogN)
Finding LCP - Llcp][ and Rlcp][
We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort
Use an interval tree that represents a binary search Each interior node corresponds to (LM, RM) for some
M For each interior node (LM, RM)
Left(LM, RM) = (LM,M) Right(LM, RM) = (M, RM)
N-2 interior nodes Leaves correspond to (i-1,i) Leaf(i-1,i) = Hgt[i]
Finding LCP - Llcp][ and Rlcp][
According to interval tree structure: Hgt[(L,R)] = min(Hgt[k])
Hgt[(L,R)] = lcp (APos[L], APos[R])
Llcp[M] = Hgt[(LM,M)]
Rlcp[M] = Hgt[(M,RM)]
k [L+1,R]
Worst Case Complexity
Suffix Array Build time
O(NlogN) Search time
O(P+logN) Structure space
O(N) 2N - 3N integers
Independent of |Σ|
Suffix Tree Build time
O(N) Search time
O(P) Structure space
O(N) Big constant
Dependent of |Σ|
Expected Time Improvements
Improve the expected case time of Search Algorithm Sort Algorithm LCP computation
Use the following assumptions All N-symbol strings are equally likely Under this assumption:
Expected length of longest repeated substring of A is O(log|Σ|N)
Expected Case Improvements - Main Idea
Let T = Let IntT(u) = integer encoding in base |Σ| of the T-
symbol prefix of u Example:
T = 3 Σ = a,b u = abaa IntT(u) = 010 = 2
There are |Σ|T ≤ N possible T-symbol prefixes IntT(u) is a number in [0,N-1]
Map each suffix Ap to IntT(Ap) Can be done in O(N) time
Nlog
Expected Case Improvements - Search Algorithm
Use an additional array Buck[] Think of the sorted array as buckets,
based on the IntT encoding
Buck[k] = min{ i | IntT (APos[i]) = k} The first position that contains a suffix
that’s mapped to k
Compute Buck[] at the end of the sort algorithm O(N) additional time
Expected Case Improvements - Search Algorithm
Given a word W We need to find Lw and Rw
Let k = IntT(W)
Lw and Rw must be in k’s bucket (Buck[k], Buck[k+1])
We only need to search one bucket
Expected Case Improvements - Search Algorithm
Number of buckets = |Σ|T ≤ N Average number of elements in a
bucket = O(1) In the binary search for W
Expected size of bucket to search = O(1)
Expected number of search steps: O(1) Expected case time: O(P)
Expected Case Improvements - Sort Algorithm
First stage of sort Sort according to first symbol
Replace first stage with sort according to IntT
Equivalent to sort according to first T symbols
Can be done in O(N) time We changed the base case of the sort
from H=1 to H=T
Expected Case Improvements - Sort Algorithm
Observation: Let C be the length of the longest
repeated substring of A Sort is in fact complete once we have
reached (C+1)-buckets Suppose some (C+1)-bucket contains more
than one suffix Then we have two suffixes with lcp > C This prefix is a repeated substring longer than
C - contradiction
Expected Case Improvements - Sort Algorithm
Expected case: C = O(log|Σ|N) = O(T) Number of stages: O(1)
Expected case time: O(N)
Expected Case Improvements - LCP Computation
Replace interval tree with sort history Binary tree Models the refinement of buckets
during the sort A vertex for each H-bucket Each vertex holds the stage number
at which its bucket was split
Expected Case Improvements - LCP Computation
Leaves correspond to suffixes and are arranged in an N element array
Each vertex has at least two children
O(N) nodes Can be built with O(N) additional
time during the sort
Expected Case Improvements - LCP Computation
Given the sort history we can compute lcp(Ap, Aq) Find the nca (nearest common
ancestor) of Ap and Aq
Let H be the nca’s stage number lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)
Recursively compute lcp(Ap+H, Aq+H) Stop when the nca is the root
Expected Case Improvements - LCP Computation
Each step is O(1) At each step the stage number of the nca
is at least halved Suppose we stop the recursion when
H < T’ =
Expected length of longest repeated substring is O(T) Expected case lcp is O(T) = O(log|Σ|N)
Nlog2
1
Expected Case Improvements - LCP Computation
O(1) recursive steps in the expected case
Expected case time for one lcp: O(1) Expected case time for computing
Llcp[], Rlcp[]: O(N)
Expected Case Improvements - LCP Computation
We need a way to find lcp’s that are known to be less than T’
Build a |Σ|T’ x |Σ|T’ array: Lookup[IntT’(x), IntT’(y)] = lcp(x,y) for all
T’-symbol strings x,y Max N entries (|Σ|T’ = √N) Compute incrementally in O(N) Final recursion steps are replaced by
O(1) lookup
Expected Time Complexity
Search time O(P)
Sort + LCP computation time O(N)