presented by dr. shazzad hosain asst. prof. eecs, nsu linear time construction of suffix tree
Post on 18-Dec-2015
217 Views
Preview:
TRANSCRIPT
Presented ByDr. Shazzad Hosain
Asst. Prof. EECS, NSU
Linear Time Construction of Suffix Tree
Suffix tree
S=xabxacS=xabxac = abxac = bxac = xac = ac = c
12
34
56
Suffix tree
S=xabxaS=xabxa = abxa = bxa = xa = a
12
34
5xa
bx
a
a
bx
a
bx
a
Suffix tree (Example)
Let s=abab, a suffix tree of s contains all the suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
ab
ab
$
ab
$
b
$
$
$
Trivial algorithm to build a Suffix tree
Put the largest suffix in
Put the suffix bab$ in
abab$
abab
$
ab$
b
s=abab$
Put the suffix ab$ in
ab
ab
$
ab$
b
ab
ab
$
ab$
b
$
{
abab$
bab$
}
Put the suffix b$ in
ab
ab
$
ab$
b
$
ab
ab
$
ab$
b
$
$
{
abab$
bab$
ab$
}
Put the suffix $ in
ab
ab
$
ab$
b
$
$
ab
ab
$
ab$
b
$
$
$
{
abab$
bab$
ab$
b$
}
We will also label each leaf with the starting point of the corres. suffix.
ab
ab
$
ab$
b
$
$
$
12
ab
ab
$
ab
$
b
3
$ 4
$
5
$
{
abab$
bab$
ab$
b$
$
}
Naive Construction – More Example
abbcbab#ab
#
bcbab#
b
#
cbab#
bcbab#
ab#
cbab#
6
1 7
3
2
5
4abbcbab#bbcbab#
Analysis
Takes O(n2) time to build.
We will see how to do it in O(n) time
Ukkonen’s linear-time Suffix Tree Algorithm
• Implicit Suffix Tree
1. Remove the terminal symbols $ from the edge labels of the tree2. Then remove any edge that has no label
Implicit Suffix Tree – More Example
12
ab
ab
$
ab$
b
3
$ 4
$
5
${
abab$
bab$
ab$
b$
$
}
1. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S
2. Let i denote the implicit suffix tree of the string S[1…i]
Ukkonen’s Algorithm at a High Level
• Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S.
• The true suffix tree for S is constructed from m , and the time for the entire algorithm is O(m)
High-level Description of Ukkonen’s Algorithm
• Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i
• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].
Naïve Algorithm of Suffix Tree
{
abab$
bab$
ab$
b$
$
}
a
b
ab
$
1
ab$
b
2
3
$ 4
$
$
5
High-level of Ukkonen’s Algorithm• Ukkonen’s algorithm is divided into m phases. In phase i+1,
tree i+1 is constructed from i
• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].
b
a
a
1
a
b
2
1 : S[1…1] {a}
2 : S[1…2] {ab, b}
a b
3 : S[1…3] {aba, ba, a}
a b ba
extensions
phases
b
a
a
1
a
b
2
1 : S[1…1] {a}
2 : S[1…2] {ab, b}
3 : S[1…3] {aba, ba, a}
extensions
O (m3)
b
a
a
1
a
b
2
1 : S[1…1] {a}
2 : S[1…2] {ab, b}
3 : S[1…3] {aba, ba, a}
Suffix Entension Rules
4 : S[1…4] {abab, bab, ab, b}1 2 b
b
Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge.
1 2 3
Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing.
β S(i+1)
Let i already there and want to extend for i+1
Suffix Entension RulesLet, i already there and want to extend for i+1
Let, 5 is drawn for axabxb123456
Now extend for 6
axabxb xabxb abxb bxb
RULE
1
xb
Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node.
RULE3
b RULE2
O (m3)
Implementation and Speedup, Suffix LinksDefinition: Let xα denotes an arbitrary string, where x is a single character and α a substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link.
Does root have a suffix link? No, because not an internal nodeEvery internal node has a suffix link.
Suffix Links – More Example
abbcbab#
ab
#
bcbab#
b
#
cbab#
bcbab#
ab#
cbab#
6
1 7
3
2
5
4
Suffix link
v
S(v)
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.
Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.
MISSISSIPI
1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI
1
MI
SS
IS
SI
PI
I
SS
I
SS
I
I
S
SI
S
SI
P
II
S
S
I
P
I
II
I
P
I
I
23
4
5
P
P6
P
7
P
8
P
9
1234567890
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.
MISSISSIPI
1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI
1
MI
SS
IS
SI
PI
I
SS
I
SS
I
I
S
SI
S
SI
P
II
S
S
I
P
I
II
I
P
I
I
23
4
5
P
P6
P
7
P
8
P
9
1234567890
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.
How suffix links help?
What is achieved so far?
Not so much. Worst-case running time is O(m2) for a phase.
Trick1: Skip/Count Trick
There must be a γ path from s(v).
Trick1: Skip/Count Trick
There must be a γ path from s(v).
Walking down along γ takes time proportional to |γ|
Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path.
zabcdefghy
2 2 3 3Nodes
But what does it buy in terms of worst-case bounds?
Edge length
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).
v=2 s(v)=1
v=3s(v)=3
v=4 s(v)=5
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).
Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.
In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link
All operations except down-walk takes constant timeOnly needs to analyze down walk time
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).
Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.
In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link
All operations except down-walk takes constant timeOnly needs to analyze down walk time
– Decreases current node-depth by at most one– Decreases node-depth by at most another one– Each down walk moves to greater node-depth
– Over the entire phase, current node-depth is decremented by at most 2m times
– Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3m over the entire phase
– Total number of edge traversal bounded by 3m– Since each edge traversal is constant, in a phase
all the down-walking is O(m).
Complexity• There are m phases• Each phase takes O(m)• So the running time is O(m2)
Two more tricks and we are done
Reference
• Chapter 6: Algorithms on Strings, Trees and Sequences
top related