presented by dr. shazzad hosain asst. prof. eecs, nsu linear time construction of suffix tree

32
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Upload: abigayle-west

Post on 18-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Linear Time Construction of Suffix Tree

Page 2: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Suffix tree

S=xabxacS=xabxac = abxac = bxac = xac = ac = c

12

34

56

Page 3: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Suffix tree

S=xabxaS=xabxa = abxa = bxa = xa = a

12

34

5xa

bx

a

a

bx

a

bx

a

Page 4: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Suffix tree (Example)

Let s=abab, a suffix tree of s contains all the suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab

$

ab

$

b

$

$

$

Page 5: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Trivial algorithm to build a Suffix tree

Put the largest suffix in

Put the suffix bab$ in

abab$

abab

$

ab$

b

s=abab$

Page 6: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Put the suffix ab$ in

ab

ab

$

ab$

b

ab

ab

$

ab$

b

$

{

abab$

bab$

}

Page 7: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Put the suffix b$ in

ab

ab

$

ab$

b

$

ab

ab

$

ab$

b

$

$

{

abab$

bab$

ab$

}

Page 8: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Put the suffix $ in

ab

ab

$

ab$

b

$

$

ab

ab

$

ab$

b

$

$

$

{

abab$

bab$

ab$

b$

}

Page 9: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

We will also label each leaf with the starting point of the corres. suffix.

ab

ab

$

ab$

b

$

$

$

12

ab

ab

$

ab

$

b

3

$ 4

$

5

$

{

abab$

bab$

ab$

b$

$

}

Page 10: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Naive Construction – More Example

abbcbab#ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4abbcbab#bbcbab#

Page 11: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Analysis

Takes O(n2) time to build.

We will see how to do it in O(n) time

Page 12: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Ukkonen’s linear-time Suffix Tree Algorithm

• Implicit Suffix Tree

1. Remove the terminal symbols $ from the edge labels of the tree2. Then remove any edge that has no label

Page 13: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Implicit Suffix Tree – More Example

12

ab

ab

$

ab$

b

3

$ 4

$

5

${

abab$

bab$

ab$

b$

$

}

1. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S

2. Let i denote the implicit suffix tree of the string S[1…i]

Page 14: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Ukkonen’s Algorithm at a High Level

• Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S.

• The true suffix tree for S is constructed from m , and the time for the entire algorithm is O(m)

Page 15: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

High-level Description of Ukkonen’s Algorithm

• Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i

• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

Page 16: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Naïve Algorithm of Suffix Tree

{

abab$

bab$

ab$

b$

$

}

a

b

ab

$

1

ab$

b

2

3

$ 4

$

$

5

Page 17: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

High-level of Ukkonen’s Algorithm• Ukkonen’s algorithm is divided into m phases. In phase i+1,

tree i+1 is constructed from i

• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

a b

3 : S[1…3] {aba, ba, a}

a b ba

extensions

phases

Page 18: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

3 : S[1…3] {aba, ba, a}

extensions

O (m3)

Page 19: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

3 : S[1…3] {aba, ba, a}

Suffix Entension Rules

4 : S[1…4] {abab, bab, ab, b}1 2 b

b

Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge.

1 2 3

Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing.

β S(i+1)

Let i already there and want to extend for i+1

Page 20: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Suffix Entension RulesLet, i already there and want to extend for i+1

Let, 5 is drawn for axabxb123456

Now extend for 6

axabxb xabxb abxb bxb

RULE

1

xb

Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node.

RULE3

b RULE2

O (m3)

Page 21: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Implementation and Speedup, Suffix LinksDefinition: Let xα denotes an arbitrary string, where x is a single character and α a substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link.

Does root have a suffix link? No, because not an internal nodeEvery internal node has a suffix link.

Page 22: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Suffix Links – More Example

abbcbab#

ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4

Suffix link

v

S(v)

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.

Page 23: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

MISSISSIPI

1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI

1

MI

SS

IS

SI

PI

I

SS

I

SS

I

I

S

SI

S

SI

P

II

S

S

I

P

I

II

I

P

I

I

23

4

5

P

P6

P

7

P

8

P

9

1234567890

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

Page 24: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

MISSISSIPI

1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI

1

MI

SS

IS

SI

PI

I

SS

I

SS

I

I

S

SI

S

SI

P

II

S

S

I

P

I

II

I

P

I

I

23

4

5

P

P6

P

7

P

8

P

9

1234567890

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

How suffix links help?

Page 25: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

What is achieved so far?

Not so much. Worst-case running time is O(m2) for a phase.

Page 26: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Trick1: Skip/Count Trick

There must be a γ path from s(v).

Page 27: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Trick1: Skip/Count Trick

There must be a γ path from s(v).

Walking down along γ takes time proportional to |γ|

Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path.

zabcdefghy

2 2 3 3Nodes

But what does it buy in terms of worst-case bounds?

Edge length

Page 28: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

v=2 s(v)=1

v=3s(v)=3

v=4 s(v)=5

Page 29: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.

In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link

All operations except down-walk takes constant timeOnly needs to analyze down walk time

Page 30: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.

In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link

All operations except down-walk takes constant timeOnly needs to analyze down walk time

– Decreases current node-depth by at most one– Decreases node-depth by at most another one– Each down walk moves to greater node-depth

– Over the entire phase, current node-depth is decremented by at most 2m times

– Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3m over the entire phase

– Total number of edge traversal bounded by 3m– Since each edge traversal is constant, in a phase

all the down-walking is O(m).

Page 31: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Complexity• There are m phases• Each phase takes O(m)• So the running time is O(m2)

Two more tricks and we are done

Page 32: Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

Reference

• Chapter 6: Algorithms on Strings, Trees and Sequences