suffix trees

67

Upload: alpha

Post on 22-Jan-2016

65 views

Category:

Documents


0 download

DESCRIPTION

Suffix Trees. Construction and Applications. João Carreira 2008. Outline. Why Suffix Trees? Definition Ukkonen's Algorithm (construction) ‏ Applications. Why Suffix Trees?. Why Suffix Trees?. Asymptotically fast. Why Suffix Trees?. Asymptotically fast. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Suffix Trees
Page 2: Suffix Trees

Suffix TreesConstruction and Applications

João Carreira

2008

Page 3: Suffix Trees

Outline

Why Suffix Trees? Definition Ukkonen's Algorithm (construction)

Applications

Page 4: Suffix Trees

Why Suffix Trees?

Page 5: Suffix Trees

Why Suffix Trees?

Asymptotically fast.

Page 6: Suffix Trees

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures.

Page 7: Suffix Trees

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them.

Page 8: Suffix Trees

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging.

Page 9: Suffix Trees

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging. Expose interesting algorithmic ideas.

Page 10: Suffix Trees

Definition

m leaves numbered 1 to m

Suffix Tree for an m-character string:

Page 11: Suffix Trees

Definition

m leaves numbered 1 to m edge-label vs node-label

Suffix Tree for an m-character string:

Page 12: Suffix Trees

Definition

m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children

Suffix Tree for an m-character string:

Page 13: Suffix Trees

Definition

m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ]

Suffix Tree for an m-character string:

Page 14: Suffix Trees

Definition

m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] no two edges out of the same node can have edge-labels

beginning with the same character

Suffix Tree for an m-character string:

Page 15: Suffix Trees

Definition Example

String: xabxac

Length (m): 6 characters

Number of Leaves: 6

Node 5 label: ac

Page 16: Suffix Trees

Implicit vs Explicit What if we have “axabx” ?

Page 17: Suffix Trees

Ukkonen's Algorithmsuffix tree construction

Page 18: Suffix Trees

Ukkonen's Algorithm

Text: S[ 1..m ] m phases phase j is divided into j extensions:

In extension j of phase i + 1: find the end of the path from the root labeled with substring S[ j..i ] extend the substring by adding the character S(i + 1) to its end

suffix tree construction

Page 19: Suffix Trees

Extension Rules Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.

Page 20: Suffix Trees

Extension Rules Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path

continues from the end of β.

Page 21: Suffix Trees

Extension Rules Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.

Page 22: Suffix Trees

Ukkonen's Algorithm

Complexity:

suffix tree construction

Page 23: Suffix Trees

Ukkonen's Algorithm

Complexity:

m phases

suffix tree construction

Page 24: Suffix Trees

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions

suffix tree construction

Page 25: Suffix Trees

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m)

suffix tree construction

Page 26: Suffix Trees

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1)

suffix tree construction

Page 27: Suffix Trees

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1)

O(m3)

suffix tree construction

Page 28: Suffix Trees

“First make it run, then make it run fast.”

Brian Kernighan

Page 29: Suffix Trees

Suffix LinksDefinition:

For an internal node v with path-label xα, if there is another node s(v), with

path-label α, then a pointer from v to s(v) is called a suffix link.

Page 30: Suffix Trees

Suffix LinksLemma:

If a new internal node v with path label xα is added to the current tree in extension

j of some phase, then either the path labeled α already ends at an internal node

or an internal at the end of the string α will be created in the next extension

of the same phase.

If Rule 2 applies:

Page 31: Suffix Trees

Suffix LinksLemma:

If a new internal node v with path label xα is added to the current tree in extension

j of some phase, then either the path labeled α already ends at an internal node

or an internal at the end of the string α will be created in the next extension

of the same phase.

If Rule 2 applies:

S[ j..i ] continues with c ≠ S(i + 1)

Page 32: Suffix Trees

Suffix LinksLemma:

If a new internal node v with path label xα is added to the current tree in extension

j of some phase, then either the path labeled α already ends at an internal node

or an internal at the end of the string α will be created in the next extension

of the same phase.

If Rule 2 applies:

S[ j..i ] continues with c ≠ S(i + 1) S[ j + 1..i ] continues with c.

Page 33: Suffix Trees

Single Extension Algorithm

Extension j of phase i + 1:

1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

Page 34: Suffix Trees

Single Extension Algorithm

Extension j of phase i + 1:

1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the

suffix link and walk down from s(v) following the path for string λ.

Page 35: Suffix Trees

Single Extension Algorithm

Extension j of phase i + 1:

1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the

suffix link and walk down from s(v) following the path for string λ.

3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.

Page 36: Suffix Trees

Single Extension Algorithm

Extension j of phase i + 1:

1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the

suffix link and walk down from s(v) following the path for string λ.

3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.

4. If a new internal w was created in extension j – 1 (by rule 2), then string α must

end at node s(w), the end node for the suffix link from w. Create the suffix link

(w, s(w)) from w to s(w).

Page 37: Suffix Trees

Node Depth

The node-depth of v is at most one greater than the node depth of s(v).

α

ßxß

xλ λ

ß

α

λ

equal node-depth: 3Node depth: 4 Node depth: 3

Page 38: Suffix Trees

γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)

Skip/count Trick

Page 39: Suffix Trees

Skip/count Trick

“Jump” from node to node. K = number of nodes in a path Time to traverse a path: O(|K|)

γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)

Page 40: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

Page 41: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1

Page 42: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

Page 43: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one.

Page 44: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one.

Page 45: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth.

Page 46: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times.

Page 47: Suffix Trees

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times. No node can have depth greater than m, so the total increment to current node-depth

(down walks) is bounded by 3m over the entire phase.

Page 48: Suffix Trees

Ukkonen's Algorithm

m phases 1 phase: O(m)

Page 49: Suffix Trees

Ukkonen's Algorithm

m phases 1 phase: O(m)

O(m2)

Page 50: Suffix Trees

“First make it run fast, then make it run faster.”

João Carreira

Page 51: Suffix Trees

Edge-Label Compression

A string with m characters has m suffixes.

If edge labels are represented with characters, O(m2) space is needed.

Page 52: Suffix Trees

Edge-Label Compression

A string with m characters has m suffixes.

If edge labels are represented with characters, O(m2) space is needed.

To achieve O(m) space, each edge-label:

(p, q)

Page 53: Suffix Trees

Two more tricks...

Page 54: Suffix Trees

Rule 3 is a show stopper

If rule 3 applies in extension j, it will also apply in all further

extensions until the end of the phase.

Why?

Page 55: Suffix Trees

Rule 3 is a show stopper

If rule 3 applies in extension j, it will also apply in all further

extensions until the end of the phase.

Why?

When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and

so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.

Page 56: Suffix Trees

Rule 3 is a show stopper

End any phase i +1 the first time rule 3 applies.

The remaining extensions are said to be done implicitly.

Page 57: Suffix Trees

Once a leaf always a leaf

Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf.

Once there is a leaf labeled j, extension rule 1 will always apply to extension j

in any sucessive phase.

Page 58: Suffix Trees

Once a leaf always a leaf

Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf.

Once there is a leaf labeled j, extension rule 1 will always apply to extension j

in any sucessive phase.

Leaf Edge Label: (p, e)

Page 59: Suffix Trees

Single Phase Algorithm

In each phase i:

Page 60: Suffix Trees

Single Phase Algorithm

During construction:

Page 61: Suffix Trees

Implicit to Explicit

One last phase to add character $: O(m)

Page 62: Suffix Trees

Suffix Trees are a Swiss Knife

Page 63: Suffix Trees

ApplicationsExact String Matching:

Page 64: Suffix Trees

ApplicationsExact String Matching:

Three ocurrences of string aw.

Preprocessing: O(m)

Search: O(n + k)

Page 65: Suffix Trees

ApplicationsAnd much more..

Longest common substring

O(n) Longest repeated substring

O(n) Longest palindrome

O(n) Most frequently occurring substrings of a minimum length

O(n) Shortest substrings occurring only once

O(n) Lempel-Ziv decomposition

O(n) .....

Page 66: Suffix Trees

“Biology easily has 500 years of exciting problems to work on.”

Donald Knuth

Page 67: Suffix Trees

web.ist.utl.pt/joao.carreira

Questions?