hyperloglog project

Count me once, count me fast!Probabilistic methods in real-time streaming

(Hyperloglog, Bloom filters)

Kendrick LoInsight Data Engineering, NYCSummer 2016

Ad ID Unique User ID

Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp

Unique User IDUnique User IDUnique User IDUnique User ID...

...?

real-time viewing data


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp


Time stamp

Unique User IDUnique User IDUnique User IDUnique User ID...

...?

13 MB100 million

uniques

bitmap(for exact counting)

4 KBbillions of uniques

hyperloglog

real-time viewing data

Hyperloglog

Count-distinct problem (a.k.a. cardinality estimation problem)

● counting unique elements in a data stream with repeated elements

● calculates an approximate number○ typical error purported to be

less than < 2%

What it can’t do:

● give an exact count● track frequency of

occurrence● confirm whether a certain

element was seen

Hyperloglog - a probabilistic methodGeneral Idea: Count leading zeros in a randomly generated binary number

Given a random number,what is the probability of seeing…?

1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...

Hyperloglog - a probabilistic method

1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...

Question:

I have a list of N unique numbers. The one with the longest string

of leading zeros is

0 0 0 0 0 0 1 x x…

What is N?

General Idea: Count leading zeros in a randomly generated binary number

Given a random number,what is the probability of seeing…?

HyperloglogID

IDIDID

ID

6

=> 128 unique viewers

5 6 7 4 6 8... ...(harmonic) MEAN: 6

IDIDID

Pipeline

Ad ID

Unique User ID

Gender

Age segments

Time stamp

Algebird

4 x m4.large

1 sec mini-batches

Pushed 1 billion records with unique user IDs

● Throughput can reach an average of 5M records/min

● Streams of <1M records processed within a minute

● After >1M uniques, delays accumulate causing system instability when using sets

Extension: counting unique viewers in a subgroup

● Associating segments with user IDs○ Challenge: Can we avoid database accesses when

processing data in real-time?○ Bloom filter: another fixed-size probabilistic data

structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2%

○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming

Ad ID Unique User ID Gender Age segment

(e.g. 18-34)Time

stampSample record

About me

Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)

J.D. / MBA, University of Toronto

Bachelor of Applied Science, University of Toronto Engineering Science (Computer)

About me

Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)

J.D. / MBA, University of Toronto

Bachelor of Applied Science, University of Toronto Engineering Science (Computer)

Thank you for listening!

appendix

[Set structures]

[HLL structures]

Results: error rate in counts

● Error < 2% for subgroups; slightly higher for main group

● Error for intersection calculation (purple) tends to be higher on average

Use cases

● Advertising○ ad viewership, website views, television viewership, app engagement, etc.

● Any application where you would want to count a large number of unique things fast

○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.

● Well suited to real-time analytics○ intermediate state of HLL structure provides for a running count○ trivially parallelizable

Ad ID Unique User ID Gender Age segment

(e.g. 18-34)Time

stampSample record

Future exploration

● Associating segments with user IDs○ quantifying incremental error associated with introduction of

Bloom filters● Apache Storm versus Spark

○ Does Storm (a “pure” streaming technology) perform much better?

● Spark DataFrames API○ seemed to introduce significant delay: would like to quantify this

Bloom Filters● Experiment with 1 million records

○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog

○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%

● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%

● Time to process:○ Bloom filter + Hyperloglog: 17s (+55%)○ Hyperloglog only: 11s

Bloom Filters

Source: Wikipedia

Tuning Probabilistic StructuresHyperloglog(source: Twitter Algebird source code: HyperLogLog.scala)

Bloom Filters(source: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)

e.g. n = 1 M (capacity) p = 0.03 (error)

=> k = 5 (# of hash functions) => m = 891 kB

https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/