data streaming algorithms for efﬁcient and accurate ...jx/reprints/sigm04_talk.pdf · data...

Data Streaming Algorithms for Efficient and AccurateEstimation of Flow Size Distribution

Abhishek KumarMinho SungJun (Jim) Xu

Networking and Telecommunications GroupCollege of Computing

Georgia Institute of Technology{akumar,mhsung,jx}@cc.gatech.edu

Jia WangAT&T Labs - Research

jiawang@research.att.com

Problem Statement

Problem: To estimate the probability distribution of flow sizes.In other words, for each positive integeri, estimateni, the numberof flows of sizei.

Problem Statement

100000

1 10 100 1000 10000 100000

flow size

Flow Distribution

Problem Statement

Problem: To estimate the probability distribution of flow sizes.In other words, for each positive integeri, estimateni, the numberof flows of sizei.

100000

1 10 100 1000 10000 100000

flow size

Flow Distribution

Definition of Flow: All packets with the same flow-label.The flow-label can be defined as any combination of fields fromthe IP header, e.g <Source IP, source Port, Dest. IP, Dest. Port,Protocol>.

Overview

Motivation

Related work: Inversion of sampled traces

System Model

Estimating total flows and flows of size 1

A holistic approach to estimate the entire distribution

A multiresolution extension of the mechanism

Motivation

• Knowledge of flow-distribution allows us to infer the usagepattern of the network, in terms of:

– The access bandwidth of the user population.

– Application types.

• It also helps in detecting anomalous events such as:

– Incipient worm infections.

– DDoS attacks.

– Route flapping.

• Enables other measurement applications, such as traffic matrixestimation.

Related work: Inverting sampled packet traces.

• Current measurement boxes collect traces via packet sampling.

• The approach is to invert the sampled distribution to obtain theactual distribution [Duffield et al., SIGCOMM’03].

• High estimation errors due to low sampling rates.

• Practical limitations to inverting sampled traffic [Hohn & Veitch,IMC’03].

Solution Architecture – System Model

streamingresult

distributionStreaming

Online

Module Module

Processing

Offline

1. Update

2. Raw

3. Flow

Packet stream

Header Header Header

Solution Architecture — Insertion Module

•Measurement proceeds in epochs (e.g. 100 seconds).

•Maintain an array of counters in fast memory (SRAM).

• For each packet, a counter is chosen via hashing, and incre-mented.

• No attempt to detect or resolve collisions.

• Data collection is lossy (erroneous), but very fast.

• At the end of the epoch, the counter array is paged to disk.

Array of counters

Array of Counters

Processor

Array of counters

Array of Counters

Processor

Packet arrival

Array of counters

Array of Counters

Processor

Choose location by hashing flow label

Array of counters

Array of Counters

Processor

Increment counter

Array of counters

Array of Counters

Processor

Array of counters

Array of Counters

Processor

Array of counters

Array of Counters

Processor

Array of counters

Array of Counters

Processor1

Array of counters

Array of Counters

Processor1

Array of counters

Array of Counters

Processor1

Collision !!3

Array of counters — Implementation

• Efficient Implementation of a Statistics Counter Architecture.[Ramabhadran & Varghese, SIGMETRICS’03]

• Small (7-bit) counter in fast memory.

• Large (32 or 64 bit) counter in slow memory.

• Perfectly fits our requirements.

Solution Architecture — Estimation Module

• The counter array is processed to obtain theCounter Value Dis-tribution. {m0 =# counters with value ’0’,y1 =# counters withvalue ’1’, · · · , yz =# counters with value ’z’.}• Use Bayesian statistics to derive the following quantities from

the counter value distribution:

– The total no. of flows,n.

– The total no. of flows with exactly one packet,n1.

– The flow distributionφ.

The shape of the “Counter Value Distribution”

100000

1 10 100 1000 10000 100000

flow size

Actual flow distributionm=1024Km=512Km=256Km=128K

The distribution of flow sizes and raw counter values (bothx andy axes are in log-scale).m = number of counters.

Estimating the no. of total flows,n, and flows of size one,n1

• Let total number of counters bem.

• The number of flows hashing to any counterc is modeled bythe Poisson random variable with parameterλ = n/m.

• There is a simple estimator for the total number of flows:n = m ln m

• This is a standard result, first used forProbabilistic Countingby Whang et al.[ACM Trans. Database Sys. 1990].

•We extend this result to derive an estimator for the total numberof flows of size 1:

n1 = y1enm (1)

• Extremely accurate estimates (within±1%).

• It is difficult to extend this approach for arbitraryi.

Estimating the entire flow distribution,φ

• Begin with a guess of the flow distribution,φold.

• Based on thisφold, compute the various possible ways of “split-ting” a particular counter value and the respective probabilitiesof such events.

• This allows us to compute a refined estimate of the flow distri-butionφnew.

• Useφold ← φnew for the next iteration.

• Repeating this multiple times allows the estimate to convergeto a local maximum.

• This is an instance ofExpectation maximization.

Estimating the entire flow distribution — an example

• For example, a counter value of 3 could be caused by threeevents:

– 3 = 3 (no hash collision);– 3 = 1 + 2 (a flow of size 1 colliding with a flow of size 2);– 3 = 1 + 1 + 1 (three flows of size 1 hashed to the same

location)

• Suppose the respective probabilities of these three events are0.5, 0.3, and 0.2 respectively, and there are 1000 counters withvalue 3.

• Then we estimate that 500, 300, and 200 counters split in thethree above ways, respectively.

• So we credit 300 * 1 + 200 * 3 = 900 ton1, the count of size 1flows, and credit 300 and 500 ton2 andn3, respectively.

Evaluation — Before and after running the Estimation algorithm

0.0001

100000

1 10 100 1000 10000 100000

flow size

Actual flow distributionRaw counter values

Estimation using our algorithm

Estimation for small flow sizes

100000

flow size

Actual flow distributionRaw counter values

Estimation using our algorithm

Sampling vs. array of counters – Web traffic.

100000

1 10 100 1000

flow size

Actual flow distributionInferred from sampling,N=10

Inferred from sampling,N=100Estimation using our algorithm

Replot with bucket-based smoothing – Web traffic.

100000

1 10 100 1000

flow size

Sampling vs. array of counters – DNS traffic.

100000

1 10 100

flow size

Variability in the number of flows

• The total number of flows can change by orders of magnitude.

• Our mechanism is sensitive to the “load factor”n/m.

100000

1 10 100 1000 10000 100000

flow size

Actual flow distributionm=1024Km=512Km=256Km=128K

0 5 10 15 20

Iteration

m=1024Km=512Km=256Km=128K

• To accommodate all possible values ofn, we extend our workto a multi-resolution version.

The multi-resolution array of counters

mm m mm

2AA 1 rA

2R r−1R rR1R

A r+1A

r−2m2

r−1m2

• The multi-resolution array of counters allows our scheme tooperate for any value ofn, with graceful degradation in accu-racy for large number of flows.

Original and estimated distributions using MRAC.

0.0001

100000

1 10 100 1000 10000 100000

flow size

Actual flow distributionEstimation using our algorithm

(c)Trace “Long” (563,080 flows).

100000

1 10 100 1000 10000

flow size

Actual flow distributionEstimation using our algorithm

(d)Trace “Short” (55,515 flows).

Conclusions

• Data-Streaming based solution for estimatong flow-distribution.

• Lossy data structure + Bayesian statistics = Accurate stream-ing

• Fast but “lossy” data-collection.

• Estimation using EM algorithm.

• An order of magnitude of improvement in estimation accuracyover sampling based solutions.

•Multiresolution version allows a tradeoff between storage costand accuracy.

Future Work

• Reusing EM results from preceeding epochs to speed up thecomputation.

• Evaluating the suitability of the mechanism for other “uncom-mon” distributions.

• Estimating the flow-size distribution of various subpopulations.

• Data-Streaming solutions to other traffic monitoring problems.

Thank You !

data streaming algorithms for efﬁcient and accurate ...jx/reprints/sigm04_talk.pdf · data...

Documents

accurate and efﬁcient lighting for skinned models ·...

kickstarter: fast and accurate computations on streaming...

gigavoxels: ray-guided streaming for efﬁcient and detailed...

accurate rank ordering of error candidates for efﬁcient...

arxiv:1503.02445v3 [cs.cv] 1 apr...

fast, accurate, automatic scaling decisions for ... ·...

reads: a random walk approach for efﬁcient and accurate...

accurate and efﬁcient modeling of 802.15.4 unslotted...

greedyboost: an accurate, efﬁcient and flexible ensemble...

an efficient hybrid scheme for fast and accurate inversion...

dolha - an efﬁcient and exact data structure for streaming

efﬁcient and accurate construction of genetic linkage...

accurate and efficient wireless device fingerprinting...

spectral graph reduction for efﬁcient image and streaming...

kickstarter: fast and accurate computations on streaming...

an efﬁcient and accurate algorithm for the perspecitve-n...

journal of la accurate, data-efﬁcient, unconstrained text...

accurate and efﬁcient sla compliance...

towards efficient and accurate image matting - catch-all...

find the best path: an efﬁcient and accurate classiﬁer...