hyperloglog project
TRANSCRIPT
![Page 1: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/1.jpg)
Count me once, count me fast!Probabilistic methods in real-time streaming
(Hyperloglog, Bloom filters)
Kendrick LoInsight Data Engineering, NYCSummer 2016
![Page 2: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/2.jpg)
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Unique User IDUnique User IDUnique User IDUnique User ID...
...?
real-time viewing data
![Page 3: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/3.jpg)
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Unique User IDUnique User IDUnique User IDUnique User ID...
...?
13 MB100 million
uniques
bitmap(for exact counting)
4 KBbillions of uniques
hyperloglog
real-time viewing data
![Page 4: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/4.jpg)
Hyperloglog
Count-distinct problem (a.k.a. cardinality estimation problem)
● counting unique elements in a data stream with repeated elements
● calculates an approximate number○ typical error purported to be
less than < 2%
What it can’t do:
● give an exact count● track frequency of
occurrence● confirm whether a certain
element was seen
![Page 5: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/5.jpg)
Hyperloglog - a probabilistic methodGeneral Idea: Count leading zeros in a randomly generated binary number
Given a random number,what is the probability of seeing…?
1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...
![Page 6: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/6.jpg)
Hyperloglog - a probabilistic method
1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...
Question:
I have a list of N unique numbers. The one with the longest string
of leading zeros is
0 0 0 0 0 0 1 x x…
What is N?
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,what is the probability of seeing…?
![Page 7: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/7.jpg)
HyperloglogID
IDIDID
ID
6
=> 128 unique viewers
5 6 7 4 6 8... ...(harmonic) MEAN: 6
IDIDID
![Page 8: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/8.jpg)
Pipeline
Ad ID
Unique User ID
Gender
Age segments
Time stamp
Algebird
4 x m4.large
1 sec mini-batches
Pushed 1 billion records with unique user IDs
![Page 9: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/9.jpg)
● Throughput can reach an average of 5M records/min
● Streams of <1M records processed within a minute
![Page 10: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/10.jpg)
![Page 11: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/11.jpg)
● After >1M uniques, delays accumulate causing system instability when using sets
![Page 12: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/12.jpg)
Extension: counting unique viewers in a subgroup
● Associating segments with user IDs○ Challenge: Can we avoid database accesses when
processing data in real-time?○ Bloom filter: another fixed-size probabilistic data
structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2%
○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming
Ad ID Unique User ID Gender Age segment
(e.g. 18-34)Time
stampSample record
![Page 13: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/13.jpg)
About me
Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
![Page 14: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/14.jpg)
About me
Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
Thank you for listening!
![Page 15: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/15.jpg)
appendix
![Page 16: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/16.jpg)
[Set structures]
![Page 17: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/17.jpg)
[HLL structures]
![Page 18: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/18.jpg)
Results: error rate in counts
● Error < 2% for subgroups; slightly higher for main group
● Error for intersection calculation (purple) tends to be higher on average
![Page 19: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/19.jpg)
Use cases
● Advertising○ ad viewership, website views, television viewership, app engagement, etc.
● Any application where you would want to count a large number of unique things fast
○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.
● Well suited to real-time analytics○ intermediate state of HLL structure provides for a running count○ trivially parallelizable
Ad ID Unique User ID Gender Age segment
(e.g. 18-34)Time
stampSample record
![Page 20: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/20.jpg)
Future exploration
● Associating segments with user IDs○ quantifying incremental error associated with introduction of
Bloom filters● Apache Storm versus Spark
○ Does Storm (a “pure” streaming technology) perform much better?
● Spark DataFrames API○ seemed to introduce significant delay: would like to quantify this
![Page 21: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/21.jpg)
Bloom Filters● Experiment with 1 million records
○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog
○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%
● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%
● Time to process:○ Bloom filter + Hyperloglog: 17s (+55%)○ Hyperloglog only: 11s
![Page 22: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/22.jpg)
Bloom Filters
Source: Wikipedia
![Page 23: Hyperloglog Project](https://reader031.vdocuments.mx/reader031/viewer/2022030215/588a48da1a28abd3088b585f/html5/thumbnails/23.jpg)
Tuning Probabilistic StructuresHyperloglog(source: Twitter Algebird source code: HyperLogLog.scala)
Bloom Filters(source: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
e.g. n = 1 M (capacity) p = 0.03 (error)
=> k = 5 (# of hash functions) => m = 891 kB