hash tables - bguds192/wiki.files/08-hash.pdf · hash tables. distributed storage: consistent...

64
Hash tables Hash tables

Upload: others

Post on 17-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Hash tables

Hash tables

Page 2: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Dictionary

Definition

A dictionary is a data structure that stores a set of elementswhere each element has a unique key, and supports thefollowing operations:

Search(S , k) Return the element whose key is k .

Insert(S , x) Add x to S .

Delete(S , x) Remove x from S

Assume that the keys are from U = 0, . . . , u − 1.

Hash tables

Page 3: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Direct addressing

S is stored in an array T [0..u − 1]. The entry T [k]contains a pointer to the element with key k , if suchelement exists, and NULL otherwise.

Search(T , k): return T [k].

Insert(T , x): T [x .key]← x .

Delete(T , x): T [x .key]← NULL.

What is the problem with this structure?0123456789

23

5

8

keysatellitedata

1

94

07

6

25

38

U

S

Hash tables

Page 4: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Hashing

In order to reduce the space complexity of directaddressing, map the keys to a smaller range0, . . .m − 1 using a hash function.

There is a problem of collisions (two or more keys thatare mapped to the same value).

There are several methods to handle collisions.

ChainingOpen addressing

Hash tables

Page 5: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

19 9

6 26S=6,9,19,26,30

m=5, h(x)=x mod 5

30

Hash tables

Page 6: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Analysis — Worst case

The worst case running times of the operations are:

Search: Θ(n).

Insert: Θ(1).

Delete: Θ(n) (can be reduced to Θ(1) using doublylinked list).

Hash tables

Page 7: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Analysis

Assumption of simple uniform hashing: any element isequally likely to hash into any of the m slots,independently of where other elements have hashed into.

The above assumption is true when the keys are chosenuniformly and independently at random (withrepetitions), and the hash function satisfies|k ∈ U : h(k) = i| = u/m for every i ∈ 0, . . . ,m− 1.We want to analyze the performance of hashing under theassumption of simple uniform hashing. This is the ballsinto bins problem.

Suppose we randomly place n balls into m bins. Let X bethe number of balls in bin 1.

The time complexity of a random search in a hash table isΘ(1 + X ).

Hash tables

Page 8: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

The expectation of X

Each ball has probability 1/m to be in bin 1.

The random variable X has binomial distribution withparameters n and 1/m.

Therefore, E [X ] = n/m.

Hash tables

Page 9: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

The distribution of X

Claim

Pr[X = r ] ≈ e−ααr

r !, where α = n/m (i.e., X has approximately

Poisson distribution).

Example

If α = 1,

Pr[X = 0] ≈ 0.368 Pr[X = 4] ≈ 0.015

Pr[X = 1] ≈ 0.368 Pr[X = 5] ≈ 0.003

Pr[X = 2] ≈ 0.184 Pr[X = 6] ≈ 0.0005

Pr[X = 3] ≈ 0.061 Pr[X = 7] ≈ 0.00007

Hash tables

Page 10: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

The distribution of X

Claim

Pr[X = r ] ≈ e−ααr

r !, where α = n/m (i.e., X has approximately

Poisson distribution).

Proof.

Pr[X = r ] =

(n

r

)(1

m

)r (1− 1

m

)n−r

=n(n − 1) · · · (n − r + 1)

r !

1

mr

(1− 1

m

)n−r

If m and n are large, n(n − 1) · · · (n − r + 1) ≈ nr and(1− 1

m

)n−r ≈ e−n/m. Thus, Pr[X = r ] ≈ e−n/m(n/m)r

r !.

Hash tables

Page 11: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

The maximum number of balls in a bin

Suppose that n = m = 106. Most bins contain 0–3 balls.

The probability that a specific bin contains at least 8 ballsis ≈ 0.000009.

However, it is very likely that some bin will contain 8balls.

Theorem

If m = Θ(n) then the size of the largest bin isΘ(log n/ log log n) with probability at least 1− 1/nΘ(1).

Hash tables

Page 12: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Rehashing

We wish to maintain n = O(m) in order to have Θ(1)search time.

This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.

19 34

6 21

h(x)=x mod 5

30

34

21

6

30

h'(x)=x mod 10

19

Hash tables

Page 13: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Rehashing

We wish to maintain n = O(m) in order to have Θ(1)search time.

This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.The cost of rehashing is Θ(n).

Hash tables

Page 14: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Rehashing

In order to keep the space usage Θ(n), we needn = Ω(m).

We can keep α ∈ [ 14, 1]. If an insert would make α > 1,

create a new table of size 2m. If a delete would makeα < 1

4, create a new table of size m/2.

Hash tables

Page 15: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Universal hash functions

Random keys + fixed hash function (e.g. mod)⇒ The hashed keys are random numbers.

A fixed set of keys + random hash function (selectedfrom a universal set)⇒ The hashed keys are semi-random numbers.

Hash tables

Page 16: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Universal hash functions

Definition

A collection H of hash functions is a universal if for every pairof distinct keys x , y ∈ U , Prh∈H[h(x) = h(y)] ≤ 1

m.

Example

Let p be a prime number larger than u.fa,b(x) = ((ax + b) mod p) mod mHp,m = fa,b|a ∈ 1, 2, . . . , p − 1, b ∈ 0, 1, . . . , p − 1

Page 17: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Universal hash functions

Theorem

Suppose that H is a universal collection of hash functions.If a hash table for S is built using a randomly chosen h ∈ H,then for every k ∈ U, the expected time of Search(S , k) isΘ(1 + n/m).

Proof.

Let X = length of T [h(k)].X =

∑y∈S Iy where Iy = 1 if h(y .key) = h(k) and Iy = 0

otherwise.

E [X ] = E

[∑y∈S

Iy

]=∑y∈S

E [Iy ] =∑y∈S

Prh∈H

[h(y .key) = h(k)]

If k /∈ S , E [X ] ≤ n · 1m

. Otherwise, E [X ] ≤ 1 + (n − 1) 1m

.

Hash tables

Page 18: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Time complexity of chaining

Under the assumption of simple uniform hashing, theexpected time of a search is Θ(1 + α) time.

If α = Θ(1), and under the assumption of simple uniformhashing, the worst case time of a search isΘ(log n/ log log n), with probability at least 1− 1/nΘ(1).

If the hash function is chosen from a universal collectionat random, the expected time of a search is Θ(1 + α).

The worst case time of insert is Θ(1) if there is norehashing.

Hash tables

Page 19: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Interpreting keys as natural numbers

How can we convert floats or ASCII strings to naturalnumbers?

An ASCII string can be interpreted as a number in base256.

Example

For the string ABXY, the ASCII values of the characters areA = 65, B = 66, X = 88, Y = 89. So CLRS is(65 ·2563)+(66 ·2562)+(88 ·2561)+(89 ·2560) = 1094867033.

Hash tables

Page 20: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Horner’s rule

Horner’s rule:

adxd + ad−1x

d−1 + · · ·+ a1x1 + a0 =

(· · · ((adx + ad−1)x + ad−2)x + · · · )x + a0.

Example:

65 · 2563 + 66 · 2562 + 88 · 2561 + 89 · 2560 =

((65 · 256 + 66) · 256 + 88) · 256 + 89

If d is large the value of y is too big.

Solution: evaluate the polynomial modulo p.

(1) y ← ad(2) for i = d − 1 to 0(3) y ← ai + xy mod p

Hash tables

Page 21: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Rabin-Karp pattern matching algorithm

Suppose we are given strings P and T , and we want tofind all occurences of P in T .The Rabin-Karp algorithm is as follows:

Compute h(P)for every substring T ′ of T of length |P |

if h(T ′) = h(P) check whether T ′ = P .The values h(T ′) for all T ′ can be computed in Θ(|T |)time using rolling hash.

Example

Let T = CABXYZ, P = ABXY. Suppose that the hashfunction is h(X ) = num(X ) mod p.

h(CABX ) = (67 · 2563 + 65 · 2562 + 66 · 256 + 88) mod p

h(ABXY ) = (65 · 2563 + 66 · 2562 + 88 · 256 + 89) mod p

= (h(CABX )− 67 · 2563) · 256 + 89 mod pHash tables

Page 22: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Applications

Data deduplication: Suppose that we have many files, andsome files have duplicates. In order to save storage space,we want to store only one instance of each distinct file.

Distributed storage: Suppose we have many files, and wewant to store them on several servers.

Hash tables

Page 23: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Distributed storage

Let m denote the number of servers.

The simple solution is to use a hash functionh : U → 1, . . . ,m, and assign file x to server h(x).

The problem with this solution is that if we add a server,we need to do rehashing which will move most filesbetween servers.

Hash tables

Page 24: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Hash tables

Page 25: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Hash tables

Page 26: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Distributed storage: Consistent hashing

When a new server m + 1 is added, let i be the serverwhose point is the first server point after h(sm+1).

We only need to reassign some of the files that wereassigned to server i .

The expected number of files reassignments is n/(m + 1).

0

server 1

server 3

server 2

server 4

Hash tables

Page 27: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

h(k)=k mod 10

Hash tables

Page 28: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

insert(T,14)

Hash tables

Page 29: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

insert(T,20)

Hash tables

Page 30: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

insert(T,34)

Hash tables

Page 31: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

19insert(T,19)

Hash tables

Page 32: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

19

55

insert(T,55)

Hash tables

Page 33: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

19

55

49

insert(T,49)

Hash tables

Page 34: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Searching

Search(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = k(4) return j(5) if T [j ] = NULL(6) return NULL(7) j ← j + 1 mod m(8) return NULL

0123456789

14

20

34

19

55

49

search(T,55)Hash tables

Page 35: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Searching

Search(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = k(4) return j(5) if T [j ] = NULL(6) return NULL(7) j ← j + 1 mod m(8) return NULL

0123456789

14

20

34

19

55

49

search(T,25)Hash tables

Page 36: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Deletion

Delete method 1: To delete element k , store inT [h(k)] a special value DELETED.

Change line (3) in Insert to:

if T [j ] = NULL OR T [j ] = DELETED

0123456789

14

20

19

55

49

delete 34

D

45

Hash tables

Page 37: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Deletion

Delete method 2: Erase k from the table (re-place it by NULL) and also erase the consecutiveblock of elements following k . Then, reinsertthe latter elements to the table.

0123456789

14

20

34

19

55

49

delete 34

45

Hash tables

Page 38: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Deletion

Delete method 2: Erase k from the table (re-place it by NULL) and also erase the consecutiveblock of elements following k . Then, reinsertthe latter elements to the table.

0123456789

14

20

19

49

55

Hash tables

Page 39: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Deletion

Delete method 2: Erase k from the table (re-place it by NULL) and also erase the consecutiveblock of elements following k . Then, reinsertthe latter elements to the table.

0123456789

14

20

19

49

5545

Hash tables

Page 40: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Open addressing

Open addressing is a generalization oflinear probing.

Leth : U × 0, . . .m − 1 → 0, . . .m − 1be a hash function such thath(k , 0), h(k , 1), . . . h(k ,m − 1) is apermutation of 0, . . .m − 1 for everyk ∈ U .

The slots examined during search/insertare h(k , 0), then h(k , 1), h(k , 2) etc.

In the example on the right,h(k , i) = (h1(k) + ih2(k)) mod 13 whereh1(k) = k mod 13h2(k) = 1 + (k mod 11)

0123456789

101112

79

6998

72

50

14

insert(T,14)

h(14,0)

h(14,1)

h(14,2)

Hash tables

Page 41: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Open addressing

Insertion is the same as in linear probing:Insert(T , k)

(1) for i = 0 to m − 1(2) j ← h(k , i)(3) if T [j ] = NULL OR T [j ] = DELETED(4) T [j ]← k(5) return(6) error “hash table overflow”

Deletion is done using delete method 1 defined above(using special value DELETED).

Hash tables

Page 42: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Double hashing

In the double hashing method,

h(k , i) = (h1(k) + ih2(k)) mod m

for some hash functions h1 and h2.The value h2(k) must be relatively prime to m for theentire hash table to be searched. This can be ensured byeither

Taking m to be a power of 2, and the image of h2

contains only odd numbers.Taking m to be a prime number, and the image of h2

contains integers from 1, . . . ,m − 1.For example,

h1(k) = k mod m

h2(k) = 1 + (k mod m′)

where m is prime and m′ < m.Hash tables

Page 43: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Time complexity of open addressing

Assume uniform hashing: the probe sequence of each keyis equally likely to be any of the m! permutations of0, . . .m − 1.Assuming uniform hashing and no deletions, the expectednumber of probes in a search is

At most 11−α for unsuccessful search.

At most 1α ln 1

1−α for successful search.

Hash tables

Page 44: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Comparison of hash table methods – Search

8 10 12 14 16 18 20 220

50

100

150

200

250

300

log n

Clo

ck C

ycle

s

Successful Lookup

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 45: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Comparison of hash table methods – Search

8 10 12 14 16 18 20 220

50

100

150

200

250

300

log n

Clo

ck C

ycle

s

Unsuccessful Lookup

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 46: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Comparison of hash table methods – Insert

8 10 12 14 16 18 20 220

50

100

150

200

250

300

350

400

450

log n

Clo

ck C

ycle

s

Insert

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 47: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Comparison of hash table methods – Delete

8 10 12 14 16 18 20 220

50

100

150

200

250

300

350

log n

Clo

ck C

ycle

s

Delete

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 48: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Bloom Filter

Bloom Filter

Page 49: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Blocking malicious web pages

Suppose we want to implement blocking of malicious webpages in a web browser.

Assume we have a list of 10,000,000 malicious pages.

We can store all pages in a hash table, but this requires alarge amount of memory.

Bloom filter is an implementation of static dictionary thatuses small amount of memory.

For a query key that is in the set, the Bloom filter alwaysreturns a positive answer.

For a query key that is not in the set, the Bloom filtercan return either a negative answer, or a positive answer(false positive).

Bloom Filter

Page 50: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Blocking malicious web pages

Suppose that the list of malicious pages is stored in aBloom filter.

For a query URL, if the answer is negative, we know thatthe URL is not a malicious page.

If the answer is positive, we do not know the correctanswer. In this case, we get the answer from a server.

We want a small probability for a false positive.

Bloom Filter

Page 51: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

0

Bloom Filter

Page 52: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01

Bloom Filter

Page 53: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 54: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 55: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 56: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Probability of false positive

Assume simple uniform hashing: any element is equallylikely to hash into any of the m slots, independently ofwhere other elements have hashed into.

For fixed i , the probability that A[i ] = 0 isp = (1− 1/m)n.

For y /∈ S , the probability that A[h(y)] = 1 is 1− p.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 57: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Reducing the false positive probability

We choose k hash functions h1, h2, . . . , hk , and setA[hj(x)]← 1 for j = 1, 2, . . . k , for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

0

Bloom Filter

Page 58: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Reducing the false positive probability

We choose k hash functions h1, h2, . . . , hk , and setA[hj(x)]← 1 for j = 1, 2, . . . k , for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 11

Bloom Filter

Page 59: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Reducing the false positive probability

We choose k hash functions h1, h2, . . . , hk , and setA[hj(x)]← 1 for j = 1, 2, . . . k , for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 60: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Reducing the false positive probability

We choose k hash functions h1, h2, . . . , hk , and setA[hj(x)]← 1 for j = 1, 2, . . . k , for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 61: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Reducing the false positive probability

We choose k hash functions h1, h2, . . . , hk , and setA[hj(x)]← 1 for j = 1, 2, . . . k , for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 62: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Probability of false positive

For fixed i , the probability that A[i ] = 0 isp = (1− 1/m)kn.

For y /∈ S , the probability that A[h(y)] = 1 is (1− p)k .

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 63: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Optimizing the probability of false positive

For the previous example (n = 2, m = 10):

For k = 1, p = 0.81, (1− p)1 = 0.19.

For k = 2, p ≈ 0.656, (1− p)2 ≈ 0.118.

For k = 3, p ≈ 0.531, (1− p)3 ≈ 0.103.

For k = 4, p ≈ 0.430, (1− p)4 ≈ 0.105.

For k = 5, p ≈ 0.349, (1− p)5 ≈ 0.117.

Bloom Filter

Page 64: Hash tables - BGUds192/wiki.files/08-hash.pdf · Hash tables. Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the rst

Optimizing the probability of false positive

Let f (k) be the probability of false positive.

p = (1− 1/m)kn ≈ e−kn/m

f (k) = (1− p)k ≈ (1− e−kn/m)k = ek·ln(1−e−kn/m)

Minimizing f (k) is equivalent to minimizingg(k) = k · ln(1− e−kn/m).dgdk

= ln(1− e−kn/m) + knm

e−kn/m

1−e−kn/m .

The minimum of g(k) is when dg/dk = 0 =⇒ k = mn

ln 2.

The minimum of f (k) is ≈ 0.6185m/n.

Example: For m = 8n, mn

ln 2 ≈ 5.55. f (5) ≈ 0.0217,f (6) ≈ 0.0216.

Bloom Filter