compact data structures: bloom filtersbecchett/elective/slide/bloom.pdf · bloom lters l. becchetti...

27
Bloom filters L. Becchetti Dictionaries and Bloom filters The maths of Bloom filters Applications of Bloom filters Compact data structures: Bloom filters Luca Becchetti “Sapienza” Universit` a di Roma – Rome, Italy April 7, 2010

Upload: others

Post on 13-Aug-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters Compact data structures: Bloom filters

Luca Becchetti

“Sapienza” Universita di Roma – Rome, Italy

April 7, 2010

Page 2: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

1 Dictionaries and Bloom filters

2 The maths of Bloom filters

3 Applications of Bloom filters

Page 3: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Dictionaries

A dynamic set S of objects from a discrete universe U, onwhich (at least) the following operations are possible:

Item insertion

Item deletion

Set memberhisp: decide whether item x ∈ S

Typically, it is assumed that each element in S is uniquelyidentified by a key. Let obj(k) be object with key k :

Operations

insert(x, S): insert item xdelete(k, S): delete item whose key is kretrieve(k, S): retrieve obj(k)

This is a minimal set of operations. Any database implementsa (greatly augmented) dictionary

Page 4: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Testing for membership

Dictionaries are often large or huge in many applications

Any of the operations above potentially involves accessto secondary storage

Set membership

Retrieval (deletion) can be restated as follows:if obj(k) ∈ S then retrieve(k, S) (delete(k,S))

Set membershipismember(k, S): if false then obj(k) 6∈ S .

Why this: membership can be tested efficiently usingcompact data structures

Check often in main memory

No need to access secondary storage if false

Page 5: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Example: spell-checker

Provide first level of spell checking for a text editor

Must quickly report spell mistakes to user

Exact check

Need efficient data structure

Trees are typically used

Terms correspond to nodes (typically leaves) of the tree

Thesaurus in the order of 105 - 106 terms

May be too large for quick response times

Idea: trade accuracy for efficiency

Page 6: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Bloom filters

Used to provide a compact summary of a set of keysKey k hashed t times on [m] = {0, . . . ,m − 1} using t“independent” hash functionsBinary array B of size m (m typically a prime)For the moment: only insertions and set membership

0

m-1h1(k)

h2(k)

ht(k)

k 1

1

1

Bloom filter

Page 7: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Use of Bloom filters (object retrieval)

Bloomfilter

retrieve(k)

ismember(k)

true

obj(k)

Time

Database

1

2

3

4

Main memory

Potential savings for retrieval (insertion/deletion)

- (3) and (4) do not occur if ismember(k) returns false- Bloom filter stored in main memory

Page 8: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Bloom filters: insertion and set membership

insert(k)

Require: k: object key1: for j : 1 . . . t do2: i = hj(k)3: if Bi == 0 then4: Bi = 15: end if6: end for

ismember(k)

Require: k: object key1: member = true; j = 12: while member == true

&& j <= t do3: i = hj(k)4: if Bi == 0 then5: member = false6: end if7: j = j + 18: end while9: return member

Figure: Bloom filter: insertion and set membership (S is implicit)

Initially, Bi = 0 for every i

B is a compact summary of keys of elements in S

Page 9: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

False positives

- No false negatives but...- Assume h1(k) = 2k + 1 mod 5, h2(k) = x + 2 mod 5- ismember(4) returns true → false positive

0

4

h1(k)

h2(k)

2

t = 2 and m = 5: Insertion of keys (5, 2, 3)

3

2

13

5

1

0

1

1

1

Page 10: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

1 Dictionaries and Bloom filters2 The maths of Bloom filters3 Applications of Bloom filters

Page 11: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

The mathematics of Bloom filters

Having false positives means that we might accessdatabase even if it contains no element with searched key

Can be acceptable if P[false positive] small

Probability of false positives

Assume n elements in the Bloom filter

Assume every hj(·) “ideal”, i.e., it hashes every itemuniformly at random and independently of the others (forthe sake of the analysis)

Consider ismember(k), with obj(k) 6∈ S

What is P[ismember(k) == true]?

“Small” if m large enough

Page 12: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Fraction of 0’s

Assume ideal h(·)’s

Assume that, after n insertions, fraction of 0’s in B is p

Consider k 6∈ B:

P[ismember(k) == true] = (1− p)t

The fraction of 0’s determines the probability of a falsepositive

p is itself a random variable that depends on t and m

Page 13: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Fraction of 0’s cont.

The Bi ’s are random variables that depend on the inputand the hash functions

After n insertions we have:

P[Bi = 0] =

(1− 1

m

)tn

E[p] =1

m

m−1∑i=0

P[Bi = 0] =

(1− 1

m

)tn

≈ e−tn/m

if X = number of 0’s then X = mp and E[X ] = mE[p]

Theorem ([Mitzenmacher, 2002])

Let X denote the number of 0’s in Bloom filter after ninsertions.

P[|p − E[p] | > ε] = P[|X −mE[p] | > εm] ≤ 2e−2ε2m2/tn

Page 14: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Fraction of 0’s cont.

The Bi ’s are not statistically independent (why?)

Proof uses an extension of Chernoff bounds

Remarks

Note that p is very close to E[p] with high probability.

Example: if m ≥ 17√

nt, p ∈ [0.9E[p] , 1.1E[p]] withprobability at least 99% → verify

In practice (see further) condition above or similar easyto satisfy

In the rest of this section we assume thatp ≈ E[p] ≈ e−tn/m deterministically

This can be made rigorous at the cost of somecomplication in the analysis

Page 15: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Choice of m and t

We have seen that with good approximation:

P[ismember(k) == true] = (1− p)t ≈ (1− e−tn/m)t

We can play with parameters m (size of Bloom filter) andt (number of hash functions)

In the remainder of the analysis, we fix m and minimizethe expression f (t) = (1− e−tn/m)t w.r.t. t (n is given,m is fixed)

We next take g(t) = ln f (t) = t ln(1− e−tn/m).Minimizing f (t) is equivalent to minimizing g(t) but thelatter is easier

Page 16: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Choice of m and t cont.

We have:

dg

dt= ln(1− e−tn/m) +

tn

m

e−tn/m

1− e−tn/m

Derivative is 0 when t = m ln 2n and this is a global

minimum

With this choice:P[ismember(k) == true] ≈ f (t) = 1

2t ≈ (0.6185)mn

Of course, the number t of hash functions has to be aninteger

Page 17: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Recap

n is given

For any given m, t = m ln 2n ideally,

⌈m ln 2

n

⌉or⌊

m ln 2n

⌋in

practice

Bloom filters highly effective if m = cn, with c a smallconstant

Example: c = 8, t = 5 or 6 → false positive probability≈ 0.02

Fixing m: in practice, choose a value a few times higherthan the max predictable size of your databse

Page 18: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Recap cont.

Assume database with n = 106 documents, keys aredocument digests of size 1Kbit each → 256 MBytes

A retrieve operation can be very expensive, caching canonly in part mitigate

Using m = 8n, we have a 1MB size Bloom filter thatoccupies an only small fraction of main memory

Still missing...

Deletions

Can be implemented at the expense of a moderateincrease in memory

Page 19: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Handling deletions

Substitute binary array with counter array (countingBloom filter)

0

4

h1(k)

h2(k)

2

Counting Bloom filter with t = 2 and m = 5: Insertion of keys (5, 2, 3)

3

2

13

5

1

0

2

1

2

Page 20: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Counting Bloom filters: insertion and deletion

insert(k)

Require: k: object key1: for j : 1 . . . t do2: i = hj(k)3: Ci = Ci + 14: end for

delete(k)

Require: k: object key1: if ismember(k) then2: for j : 1 . . . t

do3: i = hj(k)4: Ci = Ci − 15: end for6: end if

Figure: Counting Bloom filters: insertion and deletion (S is implicit)

Possible to prove that 4 bits per counter suffice for mostapplications [Broder and Mitzenmacher, 2004]

ismember(k) unchanged

Page 21: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Applications

[Broder and Mitzenmacher, 2004]

Databases maintenance (since the early 80’s)

Cooperative distributed caching (see also[Fan et al., 2000])

P2P/Overlay networks

Resource routing

Packet routing

Page 22: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Summary cache [Fan et al., 2000]

Internet Caching Protocol (ICP)

Proxies cooperate

Page 23: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Summary cache cont.

On a cache miss, a proxy contacts its neighbour proxiesinstead of requesting the page from Web server

ICP traffic can cause great overhead even for few proxies

Idea

Each proxy stores a (counting) Bloom filter of everyother proxy’s contents

Keys are the URLs

On a cache miss:1 Check locally stored Bloom filters for key membership2 Contact a proxy whose relevant Bloom filter is positive

for the key

Page 24: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Questions

Q1

Consider two dictionaries over the same universe ofobjects (and therefore keys)

Describe how and why Bloom filters allow to easilyconstruct a compact summary of their union

Q2

Dictionary in secondary storage with n items, noinsertions/deletions

retrieve(k) costs ∆ (time to access disk)

Access to main memory negligible

70% of requested items not in dictionary

Let T be response time

Design a Bloom filter such that ∆E[T ] ≥ 2, i.e., a 100%

speed-up

Page 25: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Example: spell-checker

Text editor spell-checker

Must quickly report spell mistakes to user

Thesaurus contains 105 terms

Average term length: 10 bytes

Design a Bloom filter that performs spell - checking withprobability of error 0.01

Solution

Impose that (0.6185)mn ≤ 0.01 → m

n ≥ 9.59

t = mn ln 2 ≈ 6.65

We can use a Bloom filter of size ≈ 1Mbit using 7 hashfunctions

Note that storing all words requires 1Mbyte + datastructure

Page 26: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Example: spell-checker

Text editor spell-checker

Must quickly report spell mistakes to user

Thesaurus contains 105 terms

Average term length: 10 bytes

Design a Bloom filter that performs spell - checking withprobability of error 0.01

Solution

Impose that (0.6185)mn ≤ 0.01 → m

n ≥ 9.59

t = mn ln 2 ≈ 6.65

We can use a Bloom filter of size ≈ 1Mbit using 7 hashfunctions

Note that storing all words requires 1Mbyte + datastructure

Page 27: Compact data structures: Bloom filtersbecchett/Elective/slide/bloom.pdf · Bloom lters L. Becchetti Dictionaries and Bloom lters The maths of Bloom lters Applications of Bloom lters

Bloom filters

L. Becchetti

Dictionaries andBloom filters

The maths ofBloom filters

Applications ofBloom filters

Broder, A. and Mitzenmacher, M. (2004).

Network applications of bloom filters: A survey.

In Internet Mathematics, A K Peters, Ltd., volume 1.

Fan, L., Cao, P., Almeida, J., and Broder, A. Z. (2000).

Summary cache: a scalable wide-area Web cache sharingprotocol.

IEEE/ACM Transactions on Networking, 8(3):281–293.

Mitzenmacher, M. (2002).

Compressed bloom filters.

IEEE/ACM Transactions on Networking, 10(5):604–612.