the bloom paradox ori rottenstreich joint work with yossi kanizo and isaac keslassy technion, israel
TRANSCRIPT
The Bloom Paradox
Ori Rottenstreich
Joint work with
Yossi Kanizo and Isaac Keslassy
Technion, Israel
• Requirement: A data structure in user with fast answer to• Solutions:
o O(n) – Searching in a listo O(log(n)) – Searching in a sorted listo O(1) – But with false positives / negatives
Slocal cache
Problem Definition
2
Mcentral memory with
all elements
vuzyxzx
x
usercost = 10
cost = 1x
y
cost = 10
y
user
y
• False Positive: but the data structure answers
• Results in a redundant access to the local cache.
Additional cost of 1.
• False Negative: but the data structure answers
• Results in an expensive access to the central memory instead of the local cache.
Additional cost of 10-1=9.
Two Possible Errors
3
x
y
1
• Initialization: Array of zero bits.
• Insertion: Each of the elements is hashed times, the corresponding bits are set.
• Query: Hashing the element, checking that all bits are set.
• False positive rate (probability) of • No false negatives
Bloom Filters (Bloom, 1970)
4
0000000000 00
1
y1 1
0000000000 00
1 1
z
x11
1 1
1 11 1 1
x11 1 w
1 11
• Cache/Memory Framework• Packet Classification• Intrusion Detection• Routing• Accounting• Beyond networking: Spell Checking, DNA Classification
• Can be found in o Google's web browser Chromeo Google's database system BigTableo Facebook's distributed storage system Cassandrao Mellanox's IB Switch System
Bloom Filters are Widely Used
5
Outline
Introduction to Bloom Filters
The Bloom Paradox
The Variable-Increment Counting Bloom Filter
6
The Bloom Paradox
7
Sometimes, it is better to disregard the Bloom filter results, and in fact not to even query it,
thus making the Bloom filter useless.
• Parameters:
• Extreme case without locality: All elements with equal probability of
belonging to the cache.o Toy example
Example
8
Bloom filter
• Parameters:• Let be the set of elements that the Bloom filter indicates are in
o In particular, no false negatives →
• Intuition:
Slocal cache
Mcentral memory with
all elements
vuzyxzx
cost = 10cost = 1
cost = 10
The Bloom Paradox
. .
userBBloom filterBloom filter
9
• Parameters:• Let be the set of elements that the Bloom filter indicates are in
o In particular, no false negatives →
• Surprise:
cost = 1
Slocal cache
Mcentral memory with
all elements
vuzyxzx
cost = 10
cost = 10
The Bloom Paradox
. . 9
BBloom filter
• Parameters:• Let be the set of elements that the Bloom filter indicates are in
o In particular, no false negatives →
• Surprise:
The Bloom filter indicates the membership of
elements. Only of them are indeed in .
The Bloom Paradox
. .
BBloom filter
• When the Bloom filter states that , it is wrong with probability
• Average cost if we listen to the Bloom filter:
• Average cost if we don’t:
The Bloom filter is useless!
The Bloom Paradox
11
Don’t listen to the Bloom filter
= =
Outline
Introduction to Bloom Filters
The Bloom Paradox
The Variable-Increment Counting Bloom Filter
12
1
• Bloom filters do not support deletions of elements. Simply resetting bits might cause false negatives.
• The solution: Counting Bloom filters - Storing array of counters instead of bits.o Insertion: Incrementing counters by one.o Deletion: Decrementing counters by one. o Query: Checking that counters are positive.
• The same false positive probability.• Require too much memory, e.g. 57 bits per element for .
Counting Bloom Filters (CBFs)
y+1 +1
0102001010 01
+1 +1x
+1+1
0000001010 00
x11 111
• Upon query, we should consider the exact values of the counters and not just their positiveness
• Can we design a deterministic scheme that exploits the exact values of the counters?
• Idea: Use variable increments to encode the element identity
Intuition for Variable Increments
14
0381052010 12
zy
• Each hash entry contains a pair of counters:o , fixed increments → number of elements in entry (as in CBF)o , variable increments → weighted sum of elements
o weights from a pre-determined set
Architecture
15
34 9 6 2626 17 210 25
5 3 3 42 30 3c1
c2
2 7 8 94 5 61 3
2
• We use two sets of hash functions:o The first set uses hash functions with range
, i.e. it points to the set of entries.o The second set uses hash functions with
range , i.e. it points to the set .
• Insertion:At each entry , the two counters are updated as follows.
o o from the set
• Example 1:
Insertion
16
34 9 13 2617 17 210 25
5 3 3 42 30 3c1
c2
2 7 8 94 5 61 3
x
+4+8
2
z
+4+13
• Query ( with )
• We ask whethero 17 can be a sum of 2 elements from the set including 4o 30 can be a sum of 3 elements from the set including 8
• No: • How should we pick the set of variable increments?
Query
17
y
We should use Sequences!
34 30 13 2617 30 210 25
5 4 3 42 30 3c1
c2
2 7 8 94 5 61 3
3
y?
8?4?
• Definition 1:Let be a sequence of positive integers.
Then, is a sequence iff all the sums
with are distinct.
• Example 2:
All the sums of elements of are distinct:
Therefore, is a sequence. • sequences are widely used in error-correcting codes.
Bh Sequences
18
The Bh-CBF Scheme Query
19
• Example 3: is a sequence
o Since , then the Bh-CBF can determine that
34 30 13 2617 30 210 25
5 4 3 42 30 3c1
c2
2 7 8 94 5 61 3
X?
1?
3
4?
• Example 3: is a sequence
The Bh-CBF Scheme Operations
19
o Here, and then necessarily
Since , the Bh-CBF can determine that
34 30 13 2617 30 210 25
5 4 3 42 30 3c1
c2
2 7 8 94 5 61 3
X?
1?
3
4?
The Bh-CBF Scheme Query
y?
8?4?
• Example 3: is a sequence
The Bh-CBF Scheme Operations
19
o Since , the Bh-CBF cannot exclude that
34 30 13 2617 30 210 25
5 4 3 42 30 3c1
c2
2 7 8 94 5 61 3
X?
1?
3
4?
z?
4? 13?
The Bh-CBF Scheme Query
y?
8?4?
• Internet trace (equinix-chicago) with real hash functions.
For the Bh-CBF, (with ).
20
Experimental Results
• The Bloom Paradoxo Discovery of the Bloom paradoxo Importance of the a priori membership probability
• The Variable-Increment Counting Bloom Filtero Can extend many variants of the counting Bloom filtero First time sequences are presented in networking applications
Concluding Remarks
21
Thank You