suffix trees

Suffix TreesConstruction and Applications

João Carreira

Outline

Why Suffix Trees? Definition Ukkonen's Algorithm (construction)

Applications

Why Suffix Trees?

Asymptotically fast.

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures.

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them.

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging.

Why Suffix Trees?

Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging. Expose interesting algorithmic ideas.

Definition

m leaves numbered 1 to m

Suffix Tree for an m-character string:

Definition

m leaves numbered 1 to m edge-label vs node-label

Definition

m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children

Definition

m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ]

Definition

m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] no two edges out of the same node can have edge-labels

beginning with the same character

Definition Example

String: xabxac

Length (m): 6 characters

Number of Leaves: 6

Node 5 label: ac

Implicit vs Explicit What if we have “axabx” ?

Ukkonen's Algorithmsuffix tree construction

Ukkonen's Algorithm

Text: S[ 1..m ] m phases phase j is divided into j extensions:

In extension j of phase i + 1: find the end of the path from the root labeled with substring S[ j..i ] extend the substring by adding the character S(i + 1) to its end

suffix tree construction

Extension Rules Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.

Extension Rules Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path

continues from the end of β.

Extension Rules Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.

Ukkonen's Algorithm

Complexity:

Ukkonen's Algorithm

Complexity:

m phases

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m)

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1)

Ukkonen's Algorithm

Complexity:

m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1)

“First make it run, then make it run fast.”

Brian Kernighan

Suffix LinksDefinition:

For an internal node v with path-label xα, if there is another node s(v), with

path-label α, then a pointer from v to s(v) is called a suffix link.

Suffix LinksLemma:

If a new internal node v with path label xα is added to the current tree in extension

j of some phase, then either the path labeled α already ends at an internal node

or an internal at the end of the string α will be created in the next extension

of the same phase.

If Rule 2 applies:

Suffix LinksLemma:

of the same phase.

If Rule 2 applies:

S[ j..i ] continues with c ≠ S(i + 1)

Suffix LinksLemma:

of the same phase.

If Rule 2 applies:

S[ j..i ] continues with c ≠ S(i + 1) S[ j + 1..i ] continues with c.

Single Extension Algorithm

Extension j of phase i + 1:

1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the

suffix link and walk down from s(v) following the path for string λ.

3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.

4. If a new internal w was created in extension j – 1 (by rule 2), then string α must

end at node s(w), the end node for the suffix link from w. Create the suffix link

(w, s(w)) from w to s(w).

Node Depth

The node-depth of v is at most one greater than the node depth of s(v).

xλ λ

equal node-depth: 3Node depth: 4 Node depth: 3

γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)

Skip/count Trick

“Jump” from node to node. K = number of nodes in a path Time to traverse a path: O(|K|)

γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)

Ukkonen's AlgorithmUsing the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time.

Proof:

There are i + 1 ≤ m extensions in phase i + 1

Proof:

There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

Proof:

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one.

Proof:

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one.

Proof:

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth.

Proof:

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times.

Proof:

walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times. No node can have depth greater than m, so the total increment to current node-depth

(down walks) is bounded by 3m over the entire phase.

Ukkonen's Algorithm

m phases 1 phase: O(m)

Ukkonen's Algorithm

m phases 1 phase: O(m)

“First make it run fast, then make it run faster.”

João Carreira

Edge-Label Compression

A string with m characters has m suffixes.

If edge labels are represented with characters, O(m2) space is needed.

Edge-Label Compression

A string with m characters has m suffixes.

If edge labels are represented with characters, O(m2) space is needed.

To achieve O(m) space, each edge-label:

(p, q)

Two more tricks...

Rule 3 is a show stopper

If rule 3 applies in extension j, it will also apply in all further

extensions until the end of the phase.

If rule 3 applies in extension j, it will also apply in all further

extensions until the end of the phase.

When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and

so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.

End any phase i +1 the first time rule 3 applies.

The remaining extensions are said to be done implicitly.

Once a leaf always a leaf

Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf.

Once there is a leaf labeled j, extension rule 1 will always apply to extension j

in any sucessive phase.

Once a leaf always a leaf

Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf.

Once there is a leaf labeled j, extension rule 1 will always apply to extension j

in any sucessive phase.

Leaf Edge Label: (p, e)

Single Phase Algorithm

In each phase i:

Single Phase Algorithm

During construction:

Implicit to Explicit

One last phase to add character $: O(m)

Suffix Trees are a Swiss Knife

ApplicationsExact String Matching:

Three ocurrences of string aw.

Preprocessing: O(m)

Search: O(n + k)

ApplicationsAnd much more..

Longest common substring

O(n) Longest repeated substring

O(n) Longest palindrome

O(n) Most frequently occurring substrings of a minimum length

O(n) Shortest substrings occurring only once

O(n) Lempel-Ziv decomposition

O(n) .....

“Biology easily has 500 years of exciting problems to work on.”

Donald Knuth

web.ist.utl.pt/joao.carreira

Questions?

suffix trees

pathlabel x

path label x

leaf j

extension j of phase

path of substring

current tree

childrensuffix tree

charactersuffix tree

Documents

compressed suffix arrays and suffix trees...

bust-bundled suffix trees -...

cse 549: suffix tries & suffix trees

converting suffix trees into factor/suffix oracles · pdf...

advanced tree structures binary trees, b-trees, heaps,...

bundled suffix trees - units.it

probabilistic suffix trees maria cutumisu cmput 606 october...

text stream mining using suffix trees

exact string matching, suffix trees, and applications

augmenting suffix trees, with applications

suffix trees and derived applications carl bergenhem and...

suffix trees and suffix arrays presentation by haim kaplan

compressed data structures for suffix trees - departamento...

suffix trees and their applications

obtaining provably good performance from suffix trees in...

296.3page 1 296.3: algorithms in the real world suffix trees

compressed suffix arrays and suffix trees with

1. 2 overview suffix tries on-line construction of suffix...

first discussion section...

compressed suffix arrays and suffix trees with applications...