sampling based range partition for big data analytics + some extras

43
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou QUEST Workshop, September 2012

Upload: destiny-mcgowan

Post on 01-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Sampling Based Range Partition for Big Data Analytics + Some Extras. Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou. INQUEST Workshop, September 2012. Big Data Analytics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sampling Based Range Partition for Big Data Analytics + Some Extras

Sampling Based Range Partition for Big Data Analytics

+ Some Extras

Milan VojnovićMicrosoft Research

Cambridge, United Kingdom

Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou

INQUEST Workshop, September 2012

Page 2: Sampling Based Range Partition for Big Data Analytics + Some Extras

2

Big Data Analytics

• Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data

• Some figures of scale– Peta / Tera bytes of online services data processed daily– 200M tweets per day (Twitter)– 1B of content pieces shared per day (Facebook)– 8,000 Exabytes of global data by 2015 (The Economist)

Page 3: Sampling Based Range Partition for Big Data Analytics + Some Extras

3

Research Agenda

Machine learning OptimizationDatabase

queries

Distributed computing system

Page 4: Sampling Based Range Partition for Big Data Analytics + Some Extras

4

Outline

• Range Partitionwith Fei Xu and Jingren Zhou

• Count Trackingwith Zhenming Liu and Bozidar Radunovic

• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic

Page 5: Sampling Based Range Partition for Big Data Analytics + Some Extras

5

Range Partition

• Special interest: balanced range partition

1 23241024

883 1201 23241024

812052 1 2324102412083 52

1-100 101-250 950-1024

. . .

. . .

(120,10)(120,5) (120,4)

1 2 k

1 2 m

Page 6: Sampling Based Range Partition for Big Data Analytics + Some Extras

6

Range Partition Requirements• Given and and desired relative

partition sizes

• -accurate range partition:

with probability at least

𝑄𝑖

𝑖

= number of data items assigned to range

Page 7: Sampling Based Range Partition for Big Data Analytics + Some Extras

7

Two Approaches

• Sampling based methods– Take a sample of data items– Compute partition boundaries using the sample

• Quantile summary methods– At each node compute a local quantile summary– Merge at the coordinator node

Page 8: Sampling Based Range Partition for Big Data Analytics + Some Extras

8

Related Work• Sampling based estimation of histograms

studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998)

Required sample size:

• Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) :

For therwise:

Page 9: Sampling Based Range Partition for Big Data Analytics + Some Extras

9

Related Work (cont’d)

• Quantile summaries based approach (Greenwald and Khanna, 2001)

Communication cost =

• Pros– Deterministic guarantee

• Cons– It requires sorting of data items– Largest frequency of an item must be at most

Page 10: Sampling Based Range Partition for Big Data Analytics + Some Extras

10

Problem

• Range partition data while making one pass through data with minimal communication between the coordinator and sites

Page 11: Sampling Based Range Partition for Big Data Analytics + Some Extras

11

Sampling Based Method

1

2

k

coordinator

• Collect samples and partition using the samples

.

.

.• Pros

– simplicity, scalability• Cons

– how many samples to take from each site?

data size imbalance: number of data input records per machine may differ from one machine to another

Page 12: Sampling Based Range Partition for Big Data Analytics + Some Extras

12

Data Sizes Imbalance

Dataset Records Bytes Sites

DataSet-1 62M 150G 262

DataSet-2 37M 25G 80

DataSet-3 13M 0.26G 1

DataSet-4 7M 1.2T 301

DataSet-5 106M 7T 5652

Page 13: Sampling Based Range Partition for Big Data Analytics + Some Extras

13

Origins of Data Sizes Imbalance

• JOINSELECTFROM A INNER JOIN B ON A.KEY==B.KEYORDER BY COL

• Lookup TableIf the record value of column X is inthe lookup table, then return the row

• UNPIVOTInput: Col 1 Col 2

1 2, 32 3, 9, 8, 13…

Output: (1,2), (1,3), (2,3), (2,9), …

Page 14: Sampling Based Range Partition for Big Data Analytics + Some Extras

14

Weighted Sampling Scheme

• SAMPLE: Each site reports a random sample of t/k data items and the total number of items

• MERGE: Summary created by adding each data item from site for times

• PARTITION: Use the summary to determine partition boundaries

Note: the total number of data items reported by a site only once available – the site made one pass through local data

Page 15: Sampling Based Range Partition for Big Data Analytics + Some Extras

15

SAMPLE

(𝑛1 ,𝑆1)

(𝑛2 ,𝑆2)

(𝑛𝑘 ,𝑆𝑘)

1

2

k

coordinator

𝑆 𝑖={𝑎1𝑖 ,𝑎2

𝑖 ,…,𝑎𝑡𝑖 }

.

.

.

Page 16: Sampling Based Range Partition for Big Data Analytics + Some Extras

16

MERGE

coordinator

.

.

.

𝑛1 ,𝑆1={𝑎11 ,𝑎2

1 ,…,𝑎𝑡1}

𝑛2 ,𝑆2={𝑎12 ,𝑎2

2 ,… ,𝑎𝑡2 }

𝑛𝑘 ,𝑆𝑘={𝑎1𝑘 ,𝑎2

𝑘 ,… ,𝑎𝑡𝑘 }

𝑆={…,𝑎2𝑖 ,𝑎2

𝑖 ,…,𝑎2𝑖 ,…}

𝑛𝑖 ,𝑆 𝑖={𝑎1𝑖 ,𝑎2

𝑖 ,…,𝑎𝑡𝑖 }

.

.

.

replicas

Page 17: Sampling Based Range Partition for Big Data Analytics + Some Extras

17

PARTITION

coordinator

0

1

Range 1 2 3 4 5

Empirical CDF of data summary

Page 18: Sampling Based Range Partition for Big Data Analytics + Some Extras

18

Sufficient Sample Size

• Assume For sample size

-accurate range partition w. p.

• largest frequency of a data value

Page 19: Sampling Based Range Partition for Big Data Analytics + Some Extras

19

Constant Factor Imbalance

• Suppose that for some

• Then

Page 20: Sampling Based Range Partition for Big Data Analytics + Some Extras

20

Proof Outline

• Large deviation analysis of the error exponent:

𝜈 𝑗

Page 21: Sampling Based Range Partition for Big Data Analytics + Some Extras

21

Performance

• DataSet-1

• 100K data records per range,

Page 22: Sampling Based Range Partition for Big Data Analytics + Some Extras

22

Performance (cont’d)

𝑎=1 𝑎=2 𝑎=4

Page 23: Sampling Based Range Partition for Big Data Analytics + Some Extras

23

Summary for Range Partitioning

• Novel weighted sampling scheme

• Provable performance guarantees

• Simple and practical– Coder transfer to Cosmos

• More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012

Page 24: Sampling Based Range Partition for Big Data Analytics + Some Extras

24

Outline

• Range Partitionwith Fei Xu and Jingren Zhou

• Count Trackingwith Zhenming Liu and Bozidar Radunovic

• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic

Page 25: Sampling Based Range Partition for Big Data Analytics + Some Extras

25

SUM Tracking ProblemMaintain estimate

1 2 3 k

SUM:

:

𝑋 1𝑋 2𝑋 3 𝑋 4

𝑋 5

Page 26: Sampling Based Range Partition for Big Data Analytics + Some Extras

26

SUM Tracking

𝑆𝑡

𝑡

(1−𝜖)𝑆𝑡

(1+𝜖 )𝑆𝑡

Page 27: Sampling Based Range Partition for Big Data Analytics + Some Extras

27

Applications

• Ex 1: database queries

SELECT SUM(AdBids)from Ads

• Ex 2: iterative solving

input data

Page 28: Sampling Based Range Partition for Big Data Analytics + Some Extras

28

State of the Art

• Count tracking [Huang, Yi and Zhang, 2011]

– Worst-case input, monotonic sum– Expected total communication:

messages

• Lower bound for worst case input [Arackaparambil, Brody and Chakrabarti, 2009]

– Expected total communication messages

Page 29: Sampling Based Range Partition for Big Data Analytics + Some Extras

29

The Challenge • Q: What are communication cost efficient

algorithms for the sum tracking problem with random input streams? – Random permutation– Random i.i.d.– Fractional Brownian motion

Page 30: Sampling Based Range Partition for Big Data Analytics + Some Extras

30

Communication Complexity Bounds

• Lower bound:

• Upper bound:

Sublinear, “price of non-monotonicity”:

Page 31: Sampling Based Range Partition for Big Data Analytics + Some Extras

31

Communication Complexity BoundsUnknown Drift Case

• Input: i.i.d. Bernoulli : unknown drift parameter

Expected total communication:

messages

• Generalizes monotonic case to constant drift case

Page 32: Sampling Based Range Partition for Big Data Analytics + Some Extras

32

Our Tracker Algorithm• Each site reports to the coordinator upon receiving a

value update with probability

• Sync all whenever the coordinator receives an update from a site

XiMi = 1Sk

S1S

S site

site

coordinator

S, Sk

S, S1S = S1 + … + Sk

Page 33: Sampling Based Range Partition for Big Data Analytics + Some Extras

33

Two Applications

• Second Frequency Moment

• Bayesian Linear Regression

Page 34: Sampling Based Range Partition for Big Data Analytics + Some Extras

34

App 1: Second Frequency Moment

• Input:

• Counter of value :

• Second frequency moment:

• Goal: track within relative accuracy

Page 35: Sampling Based Range Partition for Big Data Analytics + Some Extras

35

𝑆𝑡𝑖=1𝑠1∑𝑗

𝑆𝑡𝑖 , 𝑗

AMS Sketch

𝑆𝑡1

𝑆𝑡𝑠2

⋯ ⋯

𝑖

11

𝑠2

𝑠1𝑗

𝑆𝑡𝑖 , 𝑗

⋯⋯

𝑆𝑡=median(𝑆𝑡𝑖 )

{0,1} valued hash

• For and ,

within w. p.

Page 36: Sampling Based Range Partition for Big Data Analytics + Some Extras

36

App 1: Second Frequency Moment (cont’d)

• Sum tracking:

• Expected total communication:

Page 37: Sampling Based Range Partition for Big Data Analytics + Some Extras

37

App 2: Bayesian Linear Regression

• Feature vector , output • • Prior osterior

𝑥𝑡

𝑦 𝑡

Page 38: Sampling Based Range Partition for Big Data Analytics + Some Extras

38

App 2: Bayesian Linear Regression (cont’d)

• Posterior mean and precision:

• Sum tracking:

• Under random permutation input, the expected communication cost =

Page 39: Sampling Based Range Partition for Big Data Analytics + Some Extras

39

Summary for Sum Tracking

• Studied the sum tracking problem with non-monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion

• Proposed a novel algorithm with nearly optimal communication complexity

• Details: ACM PODS 2012

Page 40: Sampling Based Range Partition for Big Data Analytics + Some Extras

40

Outline

• Range Partitionwith Fei Xu and Jingren Zhou

• Count Trackingwith Zhenming Liu and Bozidar Radunovic

• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic

Page 41: Sampling Based Range Partition for Big Data Analytics + Some Extras

41

• Partition a graph with two objectives– Sparsely connected

components– Balanced number of

vertices per component

• Applications– Parallel processing– Community detection

Problem

Page 42: Sampling Based Range Partition for Big Data Analytics + Some Extras

42

Problem (cont’d)

1 2 3 k

• Requirements– Streaming algorithm– Single pass /

incremental– Efficient computing

• Desired– Approximation

guarantees– Average-case efficient

Page 43: Sampling Based Range Partition for Big Data Analytics + Some Extras

43

Summary for Graph Partitioning

• Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics

• Provable approximation guarantees

• More details available soon