presented by dr. shazzad hosain asst. prof. eecs, nsu linear time construction of suffix tree

Post on 18-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Linear Time Construction of Suffix Tree

Suffix tree

S=xabxacS=xabxac = abxac = bxac = xac = ac = c

12

34

56

Suffix tree

S=xabxaS=xabxa = abxa = bxa = xa = a

12

34

5xa

bx

a

a

bx

a

bx

a

Suffix tree (Example)

Let s=abab, a suffix tree of s contains all the suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab

$

ab

$

b

$

$

$

Trivial algorithm to build a Suffix tree

Put the largest suffix in

Put the suffix bab$ in

abab$

abab

$

ab$

b

s=abab$

Put the suffix ab$ in

ab

ab

$

ab$

b

ab

ab

$

ab$

b

$

{

abab$

bab$

}

Put the suffix b$ in

ab

ab

$

ab$

b

$

ab

ab

$

ab$

b

$

$

{

abab$

bab$

ab$

}

Put the suffix $ in

ab

ab

$

ab$

b

$

$

ab

ab

$

ab$

b

$

$

$

{

abab$

bab$

ab$

b$

}

We will also label each leaf with the starting point of the corres. suffix.

ab

ab

$

ab$

b

$

$

$

12

ab

ab

$

ab

$

b

3

$ 4

$

5

$

{

abab$

bab$

ab$

b$

$

}

Naive Construction – More Example

abbcbab#ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4abbcbab#bbcbab#

Analysis

Takes O(n2) time to build.

We will see how to do it in O(n) time

Ukkonen’s linear-time Suffix Tree Algorithm

• Implicit Suffix Tree

1. Remove the terminal symbols $ from the edge labels of the tree2. Then remove any edge that has no label

Implicit Suffix Tree – More Example

12

ab

ab

$

ab$

b

3

$ 4

$

5

${

abab$

bab$

ab$

b$

$

}

1. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S

2. Let i denote the implicit suffix tree of the string S[1…i]

Ukkonen’s Algorithm at a High Level

• Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S.

• The true suffix tree for S is constructed from m , and the time for the entire algorithm is O(m)

High-level Description of Ukkonen’s Algorithm

• Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i

• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

Naïve Algorithm of Suffix Tree

{

abab$

bab$

ab$

b$

$

}

a

b

ab

$

1

ab$

b

2

3

$ 4

$

$

5

High-level of Ukkonen’s Algorithm• Ukkonen’s algorithm is divided into m phases. In phase i+1,

tree i+1 is constructed from i

• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

a b

3 : S[1…3] {aba, ba, a}

a b ba

extensions

phases

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

3 : S[1…3] {aba, ba, a}

extensions

O (m3)

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

3 : S[1…3] {aba, ba, a}

Suffix Entension Rules

4 : S[1…4] {abab, bab, ab, b}1 2 b

b

Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge.

1 2 3

Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing.

β S(i+1)

Let i already there and want to extend for i+1

Suffix Entension RulesLet, i already there and want to extend for i+1

Let, 5 is drawn for axabxb123456

Now extend for 6

axabxb xabxb abxb bxb

RULE

1

xb

Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node.

RULE3

b RULE2

O (m3)

Implementation and Speedup, Suffix LinksDefinition: Let xα denotes an arbitrary string, where x is a single character and α a substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link.

Does root have a suffix link? No, because not an internal nodeEvery internal node has a suffix link.

Suffix Links – More Example

abbcbab#

ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4

Suffix link

v

S(v)

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.

MISSISSIPI

1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI

1

MI

SS

IS

SI

PI

I

SS

I

SS

I

I

S

SI

S

SI

P

II

S

S

I

P

I

II

I

P

I

I

23

4

5

P

P6

P

7

P

8

P

9

1234567890

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

MISSISSIPI

1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI

1

MI

SS

IS

SI

PI

I

SS

I

SS

I

I

S

SI

S

SI

P

II

S

S

I

P

I

II

I

P

I

I

23

4

5

P

P6

P

7

P

8

P

9

1234567890

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

How suffix links help?

What is achieved so far?

Not so much. Worst-case running time is O(m2) for a phase.

Trick1: Skip/Count Trick

There must be a γ path from s(v).

Trick1: Skip/Count Trick

There must be a γ path from s(v).

Walking down along γ takes time proportional to |γ|

Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path.

zabcdefghy

2 2 3 3Nodes

But what does it buy in terms of worst-case bounds?

Edge length

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

v=2 s(v)=1

v=3s(v)=3

v=4 s(v)=5

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.

In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link

All operations except down-walk takes constant timeOnly needs to analyze down walk time

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.

In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link

All operations except down-walk takes constant timeOnly needs to analyze down walk time

– Decreases current node-depth by at most one– Decreases node-depth by at most another one– Each down walk moves to greater node-depth

– Over the entire phase, current node-depth is decremented by at most 2m times

– Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3m over the entire phase

– Total number of edge traversal bounded by 3m– Since each edge traversal is constant, in a phase

all the down-walking is O(m).

Complexity• There are m phases• Each phase takes O(m)• So the running time is O(m2)

Two more tricks and we are done

Reference

• Chapter 6: Algorithms on Strings, Trees and Sequences

top related