a formal analysis of conservative update based approximate counting gil einziger and roy freidman...

21
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa

Upload: easter-wells

Post on 04-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

A Formal Analysis of Conservative Update Based Approximate

Counting

Gil Einziger and Roy FreidmanTechnion, Haifa

We wish to count the number of occurrences of various items from a very large domain.

To gain space efficiency, we are willing to tolerate an “approximate count” only.

Approximate Counting

Bloom Filters• An array BF of m bits and k hash functions {h1,…,hk} over the

domain [0,…,m-1]• Adding an object obj to the Bloom filter is done by computing

h1(obj),…, hk(obj) and setting the corresponding bits in BF• Checking for set membership for an object cand is done by

computing h1(cand),…, hk(cand) and verifying that all corresponding bits are set

m=11, k=3,

1 11

h1(o1)=0, h2(o1)=7, h3(o1)=5

BF=

h1(o2)=0, h2(o2)=7, h3(o2)=4

×

Counting Bloom Filters

• A vector of counters (instead of bits)• A counting Bloom filter supports the operations:

– Increment• Increment by 1 all entries that correspond to the results of the k hash

functions

– Decrement• Decrement by 1 all entries that correspond to the results of the k hash

functions

– Estimate (instead of get)• Return the minimal value of all corresponding entries

m=11

3 68

k=3, h1(o1)=0, h2(o1)=7, h3(o1)=5

CBF=

Estimate(o1)=4

4 9 7

• Give up the ability to Decrement in favor of accuracy/space efficiency– During an Increment operation, only update the

lowest counters

m=11

3 68

k=3, h1(o1)=0, h2(o1)=7, h3(o1)=5

SBF-MI=

Increment(o1) only addsto the first entry (3->4)

4

Empirically shown to improve accuracy! Up to two orders of magnitude for some workloads. – But not formally understood.

Conservative Update Technique

Motivation

• Applications: – Network messurements and heavy hitters.– Network security: anomaly detection.– Cache admission policy

Additional applications in other fields: e.g. databases and natural language processing.

TinyLFU - Cache Admission Policy (PDP 2014)

Fre

qu

en

cy

Rank

• The access distribution of most content is skewed▫Often modeled using Zipf-like functions, power-law, etc.

Long Heavy Tail For example~(50% of the weight)

A small number of very popular itemsFor example~(50% of the weight)

Cache Victim

Winner

Eviction and Admission Policies

Eviction Policy Admission Policy

New Item

One of you guys should leave …

is the new item any better than

the victim ?

What is the common Answer ?

• Conservative Update allows counting just the head items, with high accuracy, so our cache can make educated admission decisions.

Undesired

Desired Items

Conservative Update - Intuition

Admission Policy Example

More memory

Better cache management

Without admission policy

Frequency based admission policy

Cache Size

Hit

Rate

The Basic Observation

CBF =

LCS =

1 1 1

1 1 1

2 2 2

1 1 1

1

1

If we can quantify how many items are inserted to each level in the LCS we can bound the error.

A CBF is exactly like

Simple Observations• It is useful to discuss the number of items that

are inserted to each level of the LCS.

• Since all levels are considered the same – the false positive probability of each level is determined only by the number of items inserted to that level.

• A false positive at a higher level implies false positive at all lower levels.

• Known (constant) distribution • Large enough sample– We assume that we can make a ‘characteristic’

histogram.

Formally we know how many items are going to appear every number of times.

The Model

Denote A[i] - the number of items that are actually inserted to level i.• By definition: A min/max argument about the lowest level that could have experienced a false positive yields the following:

Lower Bound

𝐴 [𝑖 ]≥𝐷 [ 𝑖 ]

(𝑃 (𝐹 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑−𝐹ℜ𝑎𝑙 )≥𝑘 )≤𝐹𝑃 (𝐷 [𝑘 ] )

Upper Bound

• Is derived similar by upper bounding A[i]. • Requires a bit further assumptions.

Technical details in the paper.

Accurate Configuration – Uniform

Accurate Configuration – Zipf 1

Inaccurate Configuration – Uniform

Inaccurate Configuration – Zipf 1

Real Trace – Counting TCP packets

Summery

• A simple analysis to an extensively used approximate counting optimization.

• First to analyze it for general distributions• Lower and upper bounds on model • Good indicator on real workloads. • An extended version published as tech report.

Thank You