data structures & algorithms radix search richard newman based on slides by s. sahni and book by...

47
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Upload: terence-gilbert

Post on 19-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Data Structures & Algorithms

Radix Search

Richard Newmanbased on slides by S. Sahniand book by R. Sedgewick

Page 2: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Radix-based Keys

• Key has multiple parts

• Each part is an element of some set

• Character

• Numeral

• Key parts can be accessed (e.g., string s[i])

• Size of set is radix

Page 3: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Advantages of Radix-based Search

• Good worst-case performance

• Simpler than balanced trees, etc.

• Fast access to data

• Easy way to handle variable-length keys

• Save space (part of key in structure)

Page 4: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Disadvantages of Radix-based Search

• May be space-inefficient

• Performance depends on access to bytes of keys

• Must have distinct keys, or other way to handle duplicate keys

Page 5: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Digital Search Trees

• Similar to binary search trees

• Difference is that we use bits of the key to determine subtree to search

• Path in tree = prefix of key

Page 6: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Digital Search Trees

• Insert A-S-E-R-C-H-I-N-G

Key ReprA 00001S 10011E 00101R 10010C 00011H 01000I 01001N 01110G 00111

A

S1

E0

1010R

10C

10H

10

I10

N10

G10

Note that binary tree is not sorted in BST sense

Page 7: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Digital Search Trees

Prop 15.1: A search or insertion into a DST takes about lg N comparisons on average, and about 2 lg N comparisons in the worst case, in a tree built from N keys. The number of comparisons is never more than the number of bits in the search key.

Page 8: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Use bits of key to guide search like DST

• But keep keys in order like BST

• Allow recursive sort, etc.

• Pronounced “try-ee” or “try”

• Keys kept at leaves of a binary tree

Page 9: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Defn. 15.1: A trie is a binary tree that has keys associated with each leaf, defined as follows:

a trie for an empty set is a null link

a trie for a single key is a leaf w/key

a trie for > 1 key is an internal node with left link referring to trie for keys that start with 0, right for keys 1xxx

Page 10: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Insert A-S-E-R-C-H-I-N-G

Key ReprA 00001S 10011E 00101R 10010C 00011H 01000I 01001N 01110G 00111

A

S1

A0

Construct tree to point where prefixes match

Page 11: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Insert A-S-E-R-C-H-I-N-G

Key ReprA 00001S 10011E 00101R 10010C 00011H 01000I 01001N 01110G 00111

A10

A E10

10

10

10

10R S

S10

A

Construct tree to point where prefixes match

Page 12: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Insert A-S-E-R-C-H-I-N-G

Key ReprA 00001S 10011E 00101R 10010C 00011H 01000I 01001N 01110G 00111

10

A

10

10

10

10R SA

10C

E10

10H

Page 13: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Insert A-S-E-R-C-H-I-N-GKey Repr

A 00001S 10011E 00101R 10010C 00011H 01000I 01001N 01110G 00111

10

10

10

10

10R SA

10C

E10

10H10

10

10H I

Page 14: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Prop. 15.2: The structure of a trie is independent of key insertion order; there is one unique trie for any given set of distinct keys.

• Prop. 15.3: Insertion or search for a random key in a trie built from N random keys takes about lg N bit comparisons on average, in the worst case, bounded by bits in key

Page 15: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Tries

• Annoying feature of tries:

• One-way branching when keys have common prefix

• Prop. 15.4: A trie built from N random w-bit keys has about N/lg 2 nodes on the average (about 1.44 N)

Page 16: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Patricia Tries

• Annoying feature of tries:

• One-way branching when keys have common prefix

• Two different types of nodes in trie

• Patricia tries: fix both of these

• Practical Algorithm To Retrieve Information Coded In Alphanumeric

Page 17: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Patricia Tries

• Avoid one-way branching:

• Keep at each node the index of the next bit to test

• Skip over common prefix!

• Avoid two types of nodes:

• Store data in internal nodes

• Replace external links with back links

Page 18: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Patricia Tries

S

R4

H

0

1

E2

3

C4

A

Key ReprA 00001S 10011E 00101R 10010C 00011H 01000I 01001N 01110G 00111

Page 19: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Patricia Tries

• Prop 15.5: Insertion or search in a patricia trie built from N random bitstrings takes about lg N bit comparisons on average, and about 2 lg N in the worst case, but never more than the length of the key.

Page 20: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Map

• Radix search

• Digital Search Trees

• Tries

• Patricia Tries

• Multiway tries and TSTs

• Text string algorithms

Page 21: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Multiway Tries

• Like radix sort, can get benefit from comparing more than one bit at a time

• Compare r bits, speed up search by a factor of r

• What could possibly be bad?

• Number of links is now R=2r

• Can waste a lot of space!

Page 22: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Multiway Tries

• Structure is (almost) the same as binary tries

• Except there are R branches• Search: start at root, leftmost digit• Follow ith link if next R-ary digit is i• If null link, then miss• If reach leaf, it contains only key with

prefix matching path to it - compare

Page 23: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Existence Tries

• Only keys, no records• Insert/search• Defn. 15.2: The existence trie for a

set of keys is:• Empty set: null link• Non-empty set: internal node with

links for each possible digit to tries built with the leading digit omitted

Page 24: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Existence Tries

• Convenient to return null on miss, dummy record on hit

• Convenient to have no duplicate keys and no key a prefix of another key• Keys of fixed length, or• Use termination character with

value NULLdigit, only used as sentinel

Page 25: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Existence Tries

• No need to store any data• All keys captured in trie structure

• If reach NULLdigit at the same time we run out of key digits, search hit

• Otherwise, search miss• Insert: search until find null link, then

add nodes for each of the remaining digits in the key

Page 26: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Existence Triesnowisthetimefor

atn

h

e

i

i

m

e

s o

w

f

o

r

Page 27: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Multi-way Tries

• R-ary branching• Keys stored at leaves• Path to leaf defines prefix of key

stored at leaf• Only build tree downward until

prefixes become distinct

Page 28: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Multi-way Tries

• Defn. 15.3: The multiway trie for a set of keys associated with leaves is:• Set empty: null link• Singleton set: leaf with key• Larger set: internal node with links

for each possible digit to tries built with the leading digit omitted

Page 29: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Multi-way Tries

• Prop. 15.6: Search or insertion in a standard R-ary trie takes built from N random keys takes about logR N character comparisons, bounded by the length of the key; the number of links is about RN/ln R.

• Classic time-space tradeoff!• Larger R = faster but more space

Page 30: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)

• Each node has a character (digit) and three links

• Left link refers to subtrie with current key digit less than that of the node

• Middle link refers to subtrie with current key digit the same

• Right link refers to subtrie with current key digit greater than node’s

Page 31: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)

• TST equivalent to BST that used characters for non-null links as keys

• Like 3-way radix sorting• BSTs like QuickSort• M-ary tries like RadixSort

Page 32: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)

• Search: start at root• Recursively –

• Compare next character in key with character in node• If less, take left link• If greater, take right link• If equal, take middle and go to next

character in key• Miss if encounter null link or reach end of

key before NULLdigit

Page 33: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)

• Insert: start at root• Search –

• Find location where prefix diverges• Add new nodes for characters not

consumed by search

Page 34: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Existence TSTnowisthetimefor

n

h

e

i

i

m

e

s

o

wf

o

r

t

Page 35: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)

• Prop. 15.7: A search or insertion in a full TST requires time proportional to the key length. The number of links in a TST is at most three times the number of characters in all the keys.

Page 36: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)

• Can make more space efficient by • putting keys in leaves at point where

prefix is unique, and • eliminating one-way branching as we did

in Patricia Tries.• Can compromise speed and space by

having large branch at root (R or R2) and rest of trie is regular TST. • Works well if first char(s) well-distributed

Page 37: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)• Nice for practical use• Adapt to non-uniformity often seen• Though character set may be large, often

only a few are used, or are used after a particular prefix• Don’t make many links we don’t need

• Structured format keys• May have many symbols used• But only a few at each part of key

Page 38: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Ternary Search Trie (TST)• Nice for practical use• Search misses are really fast!• Can adapt for partial match searches

• “Don’t care” characters in search key• Can adapt for “almost match” searches

• All but (any) one character match• Access bytes or larger symbols rather

than bits (like Patricia tries), which are often better supported/efficient, or more natural to the keys

Page 39: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Text-String-Index• Recall String Index built with BST with

string pointers into a large text• Consider each position in text to be start

of a string key that runs to the end of the text

• Build a symbol table with these keys• Keys are all different (lengths alone

suffice)• Most are very long• Suffix Tree = search tree for this

Page 40: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Text-String-Index• BSTs are simple and work well for suffix

trees• Not likely to be a worst-case BST

• Patricia tries designed to do this!• Need to have bit-level access• Fast on misses

• TSTs• Simple, take advantage of byte ops• Can solve more complex problems• Can change == to mean “prefix”

Page 41: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Text-String-Index• If text is static, why not use Binary

Search?• Fast• No need to support insert/delete• Uses less memory (fewer links/pointers)

• But TSTs have some advantages• Never retrace steps• Support other operations

• Can also build FSM..• But better for linear search of new text

Page 42: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

String Search• If problem is to look for a particular string s

in a large text t• Naïve method:

• Search t linearly for s[0]• When match found at t[i],

• Match s[j] with t[i+j] for j = 1 to |s|-1• If all |s| chars match, have a match!• Else go back to searching t at t[i+1]

• Time?• |s| times |t| - not good

Page 43: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

FSM-based String Search• Fast way to look for a particular string s in

one or more (large) texts:• Build FSM for search string

• States represent prefix matched• Transition either extends match or• Fails to longest suffix of what has been

seen that is a prefix of s• Can also build for multiple search strings

Page 44: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Finite State Machinea.k.a. Finite State Automaton (FSA)

ca

d any

Set of States S – represented as nodes in graphSet of input symbols – labels on directed edgesTransition function – for state and input, next stateInitial state q0 – where to startFinal set of states F – subset of S for “accept”

Start state

F={q1,q2}

b

= {a,b,c,d}

a,b,d

a,b,d

c

c

q0 q3

q2

q1

q2

q1

Edge=transition

(q1,c)=q3

Page 45: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

FSM-based String SearchSearch for abraca

aa

abb

abrr

abraa

abracc

abraca

a

Not a a

else

a

b

a

b

Build recognizer skeletonAdd suffix-is-prefix linksAdd failure links

aStart state

Final state

Is that all of them?

Page 46: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

FSM-based String Search• Linear time in |s| to build FSM for s• Linear time (in |t|) to search large text t for all

instances of s• Can’t hope for better than that!• What about searching for more than one

string?• Build FSM for all the strings!• Linear time in sum of string lengths to build

FSM• Linear time in |t| to search all of t for all strings

Page 47: Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Summary

• Radix search

• Digital Search Trees

• Tries

• Patricia Tries

• Multiway tries and TSTs

• Text string algorithms

• FSMs for fast string matching