hyperloglog project

23
Count me once, count me fast! Probabilistic methods in real-time streaming (Hyperloglog, Bloom filters) Kendrick Lo Insight Data Engineering, NYC Summer 2016

Upload: kendrick-lo

Post on 27-Jan-2017

705 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Hyperloglog Project

Count me once, count me fast!Probabilistic methods in real-time streaming

(Hyperloglog, Bloom filters)

Kendrick LoInsight Data Engineering, NYCSummer 2016

Page 2: Hyperloglog Project

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Unique User IDUnique User IDUnique User IDUnique User ID...

...?

real-time viewing data

Page 3: Hyperloglog Project

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Ad ID Unique User ID

Time stamp

Unique User IDUnique User IDUnique User IDUnique User ID...

...?

13 MB100 million

uniques

bitmap(for exact counting)

4 KBbillions of uniques

hyperloglog

real-time viewing data

Page 4: Hyperloglog Project

Hyperloglog

Count-distinct problem (a.k.a. cardinality estimation problem)

● counting unique elements in a data stream with repeated elements

● calculates an approximate number○ typical error purported to be

less than < 2%

What it can’t do:

● give an exact count● track frequency of

occurrence● confirm whether a certain

element was seen

Page 5: Hyperloglog Project

Hyperloglog - a probabilistic methodGeneral Idea: Count leading zeros in a randomly generated binary number

Given a random number,what is the probability of seeing…?

1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...

Page 6: Hyperloglog Project

Hyperloglog - a probabilistic method

1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...

Question:

I have a list of N unique numbers. The one with the longest string

of leading zeros is

0 0 0 0 0 0 1 x x…

What is N?

General Idea: Count leading zeros in a randomly generated binary number

Given a random number,what is the probability of seeing…?

Page 7: Hyperloglog Project

HyperloglogID

IDIDID

ID

6

=> 128 unique viewers

5 6 7 4 6 8... ...(harmonic) MEAN: 6

IDIDID

Page 8: Hyperloglog Project

Pipeline

Ad ID

Unique User ID

Gender

Age segments

Time stamp

Algebird

4 x m4.large

1 sec mini-batches

Pushed 1 billion records with unique user IDs

Page 9: Hyperloglog Project

● Throughput can reach an average of 5M records/min

● Streams of <1M records processed within a minute

Page 10: Hyperloglog Project
Page 11: Hyperloglog Project

● After >1M uniques, delays accumulate causing system instability when using sets

Page 12: Hyperloglog Project

Extension: counting unique viewers in a subgroup

● Associating segments with user IDs○ Challenge: Can we avoid database accesses when

processing data in real-time?○ Bloom filter: another fixed-size probabilistic data

structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2%

○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming

Ad ID Unique User ID Gender Age segment

(e.g. 18-34)Time

stampSample record

Page 13: Hyperloglog Project

About me

Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)

J.D. / MBA, University of Toronto

Bachelor of Applied Science, University of Toronto Engineering Science (Computer)

Page 14: Hyperloglog Project

About me

Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)

J.D. / MBA, University of Toronto

Bachelor of Applied Science, University of Toronto Engineering Science (Computer)

Thank you for listening!

Page 15: Hyperloglog Project

appendix

Page 16: Hyperloglog Project

[Set structures]

Page 17: Hyperloglog Project

[HLL structures]

Page 18: Hyperloglog Project

Results: error rate in counts

● Error < 2% for subgroups; slightly higher for main group

● Error for intersection calculation (purple) tends to be higher on average

Page 19: Hyperloglog Project

Use cases

● Advertising○ ad viewership, website views, television viewership, app engagement, etc.

● Any application where you would want to count a large number of unique things fast

○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.

● Well suited to real-time analytics○ intermediate state of HLL structure provides for a running count○ trivially parallelizable

Ad ID Unique User ID Gender Age segment

(e.g. 18-34)Time

stampSample record

Page 20: Hyperloglog Project

Future exploration

● Associating segments with user IDs○ quantifying incremental error associated with introduction of

Bloom filters● Apache Storm versus Spark

○ Does Storm (a “pure” streaming technology) perform much better?

● Spark DataFrames API○ seemed to introduce significant delay: would like to quantify this

Page 21: Hyperloglog Project

Bloom Filters● Experiment with 1 million records

○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog

○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%

● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%

● Time to process:○ Bloom filter + Hyperloglog: 17s (+55%)○ Hyperloglog only: 11s

Page 22: Hyperloglog Project

Bloom Filters

Source: Wikipedia

Page 23: Hyperloglog Project

Tuning Probabilistic StructuresHyperloglog(source: Twitter Algebird source code: HyperLogLog.scala)

Bloom Filters(source: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)

e.g. n = 1 M (capacity) p = 0.03 (error)

=> k = 5 (# of hash functions) => m = 891 kB