dictionary search

30
Dictionary search Exact string search Paper on Cuckoo Hashing

Upload: veda-cote

Post on 31-Dec-2015

21 views

Category:

Documents


2 download

DESCRIPTION

Dictionary search. Exact string search. Paper on Cuckoo Hashing. Exact String Search. Given a dictionary D of K strings , of total length N , store them in a way that we can efficiently support searches for a pattern P over them. Hashing. Hashing with chaining. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dictionary search

Dictionary search

Exact string search

Paper on Cuckoo Hashing

Page 2: Dictionary search

Exact String Search

Given a dictionary D of K strings, of total

length N, store them in a way that we can

efficiently support searches for a pattern P

over them.

Hashing

Page 3: Dictionary search

Hashing with chaining

Page 4: Dictionary search

Key issue: a good hash function

Basic assumption: Uniform hashing

Avg #keys per slot = n * (1/m) = n/m = (load factor)

Page 5: Dictionary search

Search cost

m = (n)

Page 6: Dictionary search

In practice

A trivial hash function is:

prime

Page 7: Dictionary search

A “provably good” hash is

Each ai is selected at random in [0,m)

k0 k1 k2 kr

≈log2 m

r ≈ L / log2 m

a0 a1 a2 ar

K

a

prime

l = max string lenm = table size

not necessarily: (...mod p) mod m

Page 8: Dictionary search

Cuckoo Hashing

A B C

E D

2 hash tables, and 2 random choices where an item can be

stored

Page 9: Dictionary search

A B C

E D

F

A running example

Page 10: Dictionary search

A B FC

E D

A running example

Page 11: Dictionary search

A B FC

E D

G

A running example

Page 12: Dictionary search

E G B FC

A D

A running example

Page 13: Dictionary search

Cuckoo Hashing Examples

A B C

E D F

G

Random (bipartite) graph: node=cell, edge=key

Page 14: Dictionary search

Natural Extensions

More than 2 hashes (choices) per key.

Very different: hypergraphs instead of graphs. Higher memory utilization

3 choices : 90+% in experiments 4 choices : about 97%

2 hashes + bins of B-size.

Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths

but more insert time(and random access)

more memory...but more local

Page 15: Dictionary search

Dictionary search

Making one-side errors

Paper on Bloom Filter

Page 16: Dictionary search

Crawling

How to keep track of the URLs visited by a

crawler?

URLs are long

Check should be very fast

No care about small errors (≈ page not crawled)

Bloom Filter

over crawled URLs

Page 17: Dictionary search

Searching with errors...

Page 18: Dictionary search
Page 19: Dictionary search
Page 20: Dictionary search

Problem: false positives

Page 21: Dictionary search

TTT 2

Page 22: Dictionary search

Not perfectly true but...

Page 23: Dictionary search

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0,09

0,1

0 1 2 3 4 5 6 7 8 9 10

Fa

lse

po

siti

ve

rate

Hash functions

m/n = 8Opt k = 5.45...

We do have an

explicit formula

for the optimal k

Page 24: Dictionary search
Page 25: Dictionary search
Page 26: Dictionary search

Dictionary search

Prefix-string search

Reading 3.1 and 5.2

Page 27: Dictionary search

Prefix-string Search

Given a dictionary D of K strings, of total

length N, store them in a way that we can

efficiently support prefix searches for a

pattern P over them.

Page 28: Dictionary search

Trie: speeding-up searches

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

Pro: O(p) search time

Cons: edge + node labels and tree structure

Page 29: Dictionary search

Front-coding: squeezing strings

http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html...

0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html

3345%

0 http://checkmate.com/All/Natural/Washcloth.html...

….systile syzygetic syzygial syzygy….2 5 5

Gzip may be much better...

Page 30: Dictionary search

….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

systile szaielyite

CTon a sample

2-level indexing

Disk

InternalMemory A disadvantage:

•Trade-off ≈ speed vs space (because of bucket size)

2 advantages:• Search ≈ typically 1 I/O

• Space ≈ Front-coding over buckets