it probably works - qcon 2015
TRANSCRIPT
![Page 1: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/1.jpg)
It Probably Works
![Page 2: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/2.jpg)
Tyler McMullenCTO of Fastly@tbmcmullen
![Page 3: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/3.jpg)
FastlyWe’re an awesome CDN.
![Page 4: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/4.jpg)
What is a probabilistic algorithm?
![Page 5: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/5.jpg)
Why bother?
![Page 6: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/6.jpg)
–Hal Abelson and Gerald J. Sussman, SICP
“In testing primality of very large numbers chosen at random, the chance of stumbling upon a value that
fools the Fermat test is less than the chance that cosmic radiation will cause the computer to make
an error in carrying out a 'correct' algorithm. Considering an algorithm to be inadequate for the first reason but not for the second illustrates the
difference between mathematics and engineering.”
![Page 7: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/7.jpg)
Everything is probabilistic
![Page 8: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/8.jpg)
Probabilistic algorithms are not “guessing”
![Page 9: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/9.jpg)
Provably bounded error rates
![Page 10: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/10.jpg)
Don’t use them if someone’s life depends
on it.
![Page 11: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/11.jpg)
Don’t use them if someone’s life depends
on it.
![Page 12: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/12.jpg)
When should I use these?
![Page 13: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/13.jpg)
Ok, now what?
![Page 14: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/14.jpg)
The Count-distinct Problem
![Page 15: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/15.jpg)
The Problem
How many unique words are in a large corpus of text?
![Page 16: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/16.jpg)
The Problem
How many different users visited a popular website in a day?
![Page 17: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/17.jpg)
The Problem
How many unique IPs have connected to a server over the last hour?
![Page 18: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/18.jpg)
The Problem
How many unique URLs have been requested through an HTTP proxy?
![Page 19: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/19.jpg)
The Problem
How many unique URLs have been requested through an entire network of HTTP proxies?
![Page 20: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/20.jpg)
166.208.249.236 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /product/982" 200 20246 103.138.203.165 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /article/490" 200 29870 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /page/1" 200 20409 150.255.232.154 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /page/1" 200 42999 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /article/490" 200 25080 150.255.232.154 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /listing/567" 200 33617 103.138.203.165 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /listing/567" 200 29618 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /page/1" 200 30265 166.208.249.236 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /page/1" 200 5683 244.210.202.222 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /article/490" 200 47124 103.138.203.165 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /listing/567" 200 48734 103.138.203.165 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /listing/567" 200 27392 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /listing/567" 200 15705 150.255.232.154 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /page/1" 200 22587 244.210.202.222 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /product/982" 200 30063 244.210.202.222 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /page/1" 200 6041 166.208.249.236 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /product/982" 200 25783 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /article/490" 200 1099 244.210.202.222 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /product/982" 200 31494 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /listing/567" 200 30389 150.255.232.154 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /article/490" 200 10251 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /product/982" 200 19384 150.255.232.154 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /product/982" 200 24062 244.210.202.222 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /article/490" 200 19070 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /page/648" 200 45159 191.141.247.227 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "HEAD /page/648" 200 5576 166.208.249.236 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /page/648" 200 41869 166.208.249.236 -‐ -‐ [25/Feb/2015:07:20:13 +0000] "GET /listing/567" 200 42414
![Page 21: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/21.jpg)
def count_distinct(stream): seen = set() for item in stream: seen.add(item) return len(seen)
![Page 22: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/22.jpg)
Scale.
![Page 23: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/23.jpg)
Count-distinct across thousands of servers.
![Page 24: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/24.jpg)
def combined_cardinality(seen_sets): combined = set() for seen in seen_sets: combined |= seen return len(combined)
![Page 25: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/25.jpg)
The set grows linearly.
![Page 26: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/26.jpg)
Precision comes at a cost.
![Page 27: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/27.jpg)
![Page 28: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/28.jpg)
![Page 29: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/29.jpg)
“example.com/user/page/1”
![Page 30: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/30.jpg)
1 0 1 1 0 1 0 1
“example.com/user/page/1”
![Page 31: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/31.jpg)
1 0 1 1 0 1 0 1
![Page 32: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/32.jpg)
1 0 1 1 0 1 0 1
P(bit 0 is set) = 0.5
![Page 33: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/33.jpg)
1 0 1 1 0 1 0 1
P(bit 0 is set & bit 1 is set) = 0.25
![Page 34: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/34.jpg)
1 0 1 1 0 1 0 1
P(bit 0 is set & bit 1 is set & bit 2 is set) = 0.125
![Page 35: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/35.jpg)
1 0 1 1 0 1 0 1
P(bit 0 is set) = 0.5 Expected trials = 2
![Page 36: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/36.jpg)
1 0 1 1 0 1 0 1
P(bit 0 is set & bit 1 is set) = 0.25 Expected trials = 4
![Page 37: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/37.jpg)
1 0 1 1 0 1 0 1
P(bit 0 is set & bit 1 is set & bit 2 is set) = 0.125 Expected trials = 8
![Page 38: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/38.jpg)
We expect the maximum number of leading zeros we have seen + 1 to approximate
log2(unique items).
![Page 39: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/39.jpg)
Improve the accuracy of the estimate by
partitioning the input data.
![Page 40: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/40.jpg)
class LogLog(object): def __init__(self, k): self.k = k self.m = 2 ** k self.M = np.zeros(self.m, dtype=np.int) self.alpha = Alpha[k] def insert(self, token): y = hash_fn(token) j = y >> (hash_len -‐ self.k) remaining = y & ((1 << (hash_len -‐ self.k)) -‐ 1) first_set_bit = (64 -‐ self.k) -‐ int(math.log(remaining, 2)) self.M[j] = max(self.M[j], first_set_bit) def cardinality(self): return self.alpha * 2 ** np.mean(self.M)
![Page 41: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/41.jpg)
Unions of HyperLogLogs
![Page 42: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/42.jpg)
HyperLogLog
• Adding an item: O(1)
• Retrieving cardinality: O(1)
• Space: O(log log n)
• Error rate: 2%
![Page 43: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/43.jpg)
HyperLogLog
For 100 million unique items, and an error rate of 2%,
the size of the HyperLogLog is...
1,500 bytes
![Page 44: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/44.jpg)
Reliable Broadcast
![Page 45: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/45.jpg)
The Problem
Reliably broadcast “purge” messages across the world as quickly as possible.
![Page 46: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/46.jpg)
Single source of truth
![Page 47: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/47.jpg)
Single source of failures
![Page 48: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/48.jpg)
Atomic broadcast
![Page 49: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/49.jpg)
Reliable broadcast
![Page 50: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/50.jpg)
![Page 51: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/51.jpg)
![Page 52: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/52.jpg)
Gossip Protocols
![Page 53: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/53.jpg)
![Page 54: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/54.jpg)
![Page 55: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/55.jpg)
“Designed for Scale”
![Page 56: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/56.jpg)
Probabilistic Guarantees
![Page 57: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/57.jpg)
![Page 58: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/58.jpg)
Bimodal Multicast
• Quickly broadcast message to all servers
• Gossip to recover lost messages
![Page 59: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/59.jpg)
![Page 60: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/60.jpg)
![Page 61: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/61.jpg)
![Page 62: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/62.jpg)
![Page 64: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/64.jpg)
![Page 65: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/65.jpg)
One ProblemComputers have limited space
![Page 66: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/66.jpg)
Throw away messages
![Page 67: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/67.jpg)
![Page 68: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/68.jpg)
![Page 70: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/70.jpg)
“with high probability” is fine
![Page 71: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/71.jpg)
Real World
![Page 72: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/72.jpg)
End-to-End Latency
42ms
74ms
83ms
133ms
New York
London
San Jose
Tokyo
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0 50 100 150Latency (ms)
Den
sity
Density plot and 95th percentile of purge latency by server location
![Page 73: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/73.jpg)
End-to-End Latency42ms
74ms
83ms
133ms
New York
London
San Jose
Tokyo
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0 50 100 150Latency (ms)
Den
sity
Density plot and 95th percentile of purge latency by server location
![Page 74: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/74.jpg)
Packet Loss
![Page 75: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/75.jpg)
Good systems are boring
![Page 76: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/76.jpg)
What was the point again?
![Page 77: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/77.jpg)
We can build things that are otherwise
unrealistic
![Page 78: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/78.jpg)
We can build systems that are more
reliable
![Page 79: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/79.jpg)
You’re already using them.
![Page 80: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/80.jpg)
We’re hiring!
![Page 81: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/81.jpg)
Thanks
@tbmcmullen
![Page 82: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/82.jpg)
What even is this?
![Page 83: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/83.jpg)
Probabilistic Algorithms
![Page 84: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/84.jpg)
Randomized Algorithms
![Page 85: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/85.jpg)
Estimation Algorithms
![Page 86: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/86.jpg)
Probabilistic Algorithms
1. An iota of theory
2. Where are they useful and where are they not?
3. HyperLogLog
4. Locality-sensitive Hashing
5. Bimodal Multicast
![Page 87: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/87.jpg)
“An algorithm that uses randomness to improve its efficiency”
![Page 88: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/88.jpg)
Las Vegas
![Page 89: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/89.jpg)
Monte Carlo
![Page 90: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/90.jpg)
Las Vegasdef find_las_vegas(haystack, needle): length = len(haystack) while True: index = randrange(length) if haystack[index] == needle: return index
![Page 91: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/91.jpg)
Monte Carlodef find_monte_carlo(haystack, needle, k): length = len(haystack) for i in range(k): index = randrange(length) if haystack[index] == needle: return index
![Page 92: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/92.jpg)
– Prabhakar Raghavan (author of Randomized Algorithms)
“For many problems a randomized algorithm is the simplest the
fastest or both.”
![Page 93: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/93.jpg)
Naive Solution
For 100 million unique IPv4 addresses, the size of the hash is...
>400mb
![Page 94: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/94.jpg)
Slightly Less Naive
Add each IP to a bloom filter and keep a counter of the IPs that don’t collide.
![Page 95: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/95.jpg)
Slightly Less Naive
ips_seen = BloomFilter(capacity=expected_size, error_rate=0.03) counter = 0 for line in log_file: ip = extract_ip(line) if items_bloom.add(ip): counter += 1 print "Unique IPs:", counter
![Page 96: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/96.jpg)
Slightly Less Naive
• Adding an IP: O(1)
• Retrieving cardinality: O(1)
• Space: O(n)
• Error rate: 3%
kind of
![Page 97: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/97.jpg)
Slightly Less Naive
For 100 million unique IPv4 addresses, and an error rate of 3%,
the size of the bloom filter is...
87mb
![Page 98: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/98.jpg)
def insert(self, token): # Get hash of token y = hash_fn(token) # Extract `k` most significant bits of `y` j = y >> (hash_len -‐ self.k) # Extract remaining bits of `y` remaining = y & ((1 << (hash_len -‐ self.k)) -‐ 1) # Find "first" set bit of `remaining` first_set_bit = (64 -‐ self.k) -‐ int(math.log(remaining, 2)) # Update `M[j]` to max of `first_set_bit` # and existing value of `M[j]` self.M[j] = max(self.M[j], first_set_bit)
![Page 99: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/99.jpg)
def cardinality(self): # The mean of `M` estimates `log2(n)` with # an additive bias return self.alpha * 2 ** np.mean(self.M)
![Page 100: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/100.jpg)
The Problem
Find documents that are similar to one specific document.
![Page 101: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/101.jpg)
The Problem
Find images that are similar to one specific image.
![Page 102: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/102.jpg)
The Problem
Find graphs that are correlated to one specific graph.
![Page 103: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/103.jpg)
The Problem
Nearest neighbor search.
![Page 104: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/104.jpg)
The Problem
“Find the n closest points in a d-dimensional space.”
![Page 105: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/105.jpg)
The Problem
You have a bunch of things and you want to figure out which ones are similar.
![Page 106: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/106.jpg)
![Page 107: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/107.jpg)
“There has been a lot of recent work on streaming algorithms, i.e. algorithms that produce an output by making one pass (or a few passes) over the data while using a limited amount of storage space and time. To cite a few examples, ...”
{ "there": 1, "has": 1, "been": 1, “a":4, "lot": 1, "of": 2, "recent":1, ... }
![Page 108: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/108.jpg)
• Cosine similarity
• Jaccard similarity
• Euclidian distance
• etc etc etc
![Page 109: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/109.jpg)
Euclidian Distance
![Page 110: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/110.jpg)
Metric space
![Page 111: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/111.jpg)
kd-trees
![Page 112: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/112.jpg)
Curse of Dimensionality
![Page 113: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/113.jpg)
Locality-sensitive hashing
![Page 114: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/114.jpg)
![Page 115: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/115.jpg)
![Page 116: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/116.jpg)
Locality-Sensitive Hashing for Finding Nearest Neighbors - Slaney and Casey
![Page 117: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/117.jpg)
Random Hyperplanes
![Page 118: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/118.jpg)
{ "there": 1, "has": 1, "been": 1, “a":4, "lot": 1, "of": 2, "recent":1, ... }
LSH
0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 ...
![Page 119: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/119.jpg)
Cosine Similarity
LSH
![Page 120: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/120.jpg)
Firewall Partition
![Page 121: It Probably Works - QCon 2015](https://reader031.vdocuments.mx/reader031/viewer/2022030311/58eeed0c1a28abe23e8b45a7/html5/thumbnails/121.jpg)
DDoS
• `