handing data skew in mapreduce · 2013. 9. 12. · technische universitat munc¨ hen¨ summary:...

46
Technische Universit¨ at M ¨ unchen Handing Data Skew in MapReduce B. Gufler , N. Augsten , A. Reiser , A. Kemper Technische Universit¨ at M ¨ unchen Free University of Bolzano-Bozen May 8, 2011

Upload: others

Post on 15-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Handing Data Skew in MapReduce

B. Gufler♠, N. Augsten♣, A. Reiser♠, A. Kemper♠

♠Technische Universitat Munchen ♣Free University of Bolzano-Bozen

May 8, 2011

Page 2: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Cloud Data Processing with MapReduce

MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks

MapReduce for eScience: new challenges

1. heavy data skew

2. complex reducer tasks

; How to deal with these problems?

1

Page 3: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Cloud Data Processing with MapReduce

MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks

MapReduce for eScience: new challenges

1. heavy data skew

2. complex reducer tasks

; How to deal with these problems?

1

Page 4: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Cloud Data Processing with MapReduce

MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks

MapReduce for eScience: new challenges

1. heavy data skew

2. complex reducer tasks

; How to deal with these problems?

1

Page 5: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Cloud Data Processing with MapReduce

MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks

MapReduce for eScience: new challenges

1. heavy data skew

2. complex reducer tasks

; How to deal with these problems?

1

Page 6: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Data Skew

I values of an attribute not evenly distributedI very common in scientific dataI causes load imbalance if attribute is used for partitioning

2

Page 7: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Example: Node Masses in Millennium Simulation

0

100

200

300

400nodes (∗1M)

node mass1–

50

434M

51–5

00

295M

501–

5k

29M

5k–5

0k

2.5M

50k–

500k

143k

500k

–5M

1.8k

3

Page 8: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Complex Reducer Tasks

I complex scientific analysis algorithmsI often polynomial or even exponential complexity

I amplifies load imbalance

Map

Map

Map

Reduce

Reduce

6 · 22 = 24 42 + 82 = 80

4

Page 9: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Complex Reducer Tasks

I complex scientific analysis algorithmsI often polynomial or even exponential complexity

I amplifies load imbalance

Map

Map

Map

Reduce

Reduce

6 · 22 = 24

42 + 82 = 80

4

Page 10: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Complex Reducer Tasks

I complex scientific analysis algorithmsI often polynomial or even exponential complexity

I amplifies load imbalance

Map

Map

Map

Reduce

Reduce

6 · 22 = 24 42 + 82 = 80

4

Page 11: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Example: Frequent Subtree Mining

I find common substructures in(large parts of) a forest

I exponential complexity forunordered subtrees

aT1

b c

d a

dT2

a

c

a

bT3

a

d

c

d

5

Page 12: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Example: Frequent Subtree Mining

I find common substructures in(large parts of) a forest

I exponential complexity forunordered subtrees

aT1

b c

d a

dT2

a

c

a

bT3

a

d

c

d

5

Page 13: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Example: Frequent Subtree Mining

I find common substructures in(large parts of) a forest

I exponential complexity forunordered subtrees

aT1

b c

d a

dT2

a

c

a

bT3

a

d

c

d

5

Page 14: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Example: Frequent Subtree Mining

I find common substructures in(large parts of) a forest

I exponential complexity forunordered subtrees

aT1

b c

d a

dT2

a

c

a

bT3

a

d

c

d

5

Page 15: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Summary: MapReduce for eScience

ChallengesI load imbalance due to data skewI complex analysis amplifies execution time differencesI example: frequent subtree mining (PathJoin algorithm)

I 16 GB data set (subset of the Millennium simulation)I cluster of 16 nodesI total execution time: 8 hoursI execution time difference of reducers: up to 6 hours

6

Page 16: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Agenda

I MotivationI Load Balancing in MapReduce: Big PictureI Estimation of Partition CostI Assignment of Partitions to ReducersI Number of PartitionsI Evaluation

7

Page 17: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Data Partitioning in MapReduce

Map

cluster: all itemswith the samekey

partition: all keyswith the samehash values

hash function fordata partitioning

8

Page 18: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Load Balancing: Current MapReduceP

0P

1

P0

P1

P0

P1

Map

Red

uce

P0 P1 cont

rolle

r

I hash partitioning on each mapperI static 1:1 assignment of partitions to

reducersI balances number of keys per

reducer instead of workloadI no flexibility

9

Page 19: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Improved Load Balancing

I create more partitions than there are reducersI cost-based assignment of partitions to reducers

P0

P1

P2

P0

P1

P2

P0

P1

P2

Map

Red

uce

P0+P2 P1 cont

rolle

r

I flexibility for load balancingI challenges

1. compute number ofpartitions

2. estimate partition cost3. assign partitions to

reducers

10

Page 20: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost

partition cost depends on1. number and size of the clusters within the partition

I locally monitored on each mapperI centrally aggregated on one controller

2. complexity of the reducer algorithmI specified by the user

; estimate cluster cost based on this information

; sum up cluster costs to obtain partition cost

11

Page 21: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: Monitoring

I tuple countI monitor local count on every mapperI sum up on controller

I cluster countI create Bloom filter on every mapperI employ Linear Counting on controller

12

Page 22: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: MonitoringM

appe

r1

tuples: 17

clusters: [10011]

P0

Map

per2

tuples: 31

clusters: [01011]

P0

13

Page 23: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: MonitoringM

appe

r1

tuples: 17

clusters: [10011]

P0

Map

per2

tuples: 31

clusters: [01011]

P0

Agg

rega

te

13

Page 24: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: MonitoringM

appe

r1

tuples: 17

clusters: [10011]

P0

Map

per2

tuples: 31

clusters: [01011]

P0

Agg

rega

te

13

Page 25: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: MonitoringM

appe

r1

tuples: 17

clusters: [10011]

P0

Map

per2

tuples: 31

clusters: [01011]

P0

Agg

rega

tetuples: 48∑

13

Page 26: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: MonitoringM

appe

r1

tuples: 17

clusters: [10011]

P0

Map

per2

tuples: 31

clusters: [01011]

P0

Agg

rega

tetuples: 48∑|

13

Page 27: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: MonitoringM

appe

r1

tuples: 17

clusters: [10011]

P0

Map

per2

tuples: 31

clusters: [01011]

P0

Agg

rega

tetuples: 48∑| lc

K. Whang, B. Vander-Zanden, H. Taylor: A Linear-Time Probabilistic Counting Algorithm for Databa-se Applications, ACM TODS 15(2), 1990

13

Page 28: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: MonitoringM

appe

r1

tuples: 17

clusters: [10011]

P0

Map

per2

tuples: 31

clusters: [01011]

P0

Agg

rega

tetuples: 48

clusters: 8

∑| lc

K. Whang, B. Vander-Zanden, H. Taylor: A Linear-Time Probabilistic Counting Algorithm for Databa-se Applications, ACM TODS 15(2), 1990

13

Page 29: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: Calculation

Agg

rega

te tuples: 48

clusters: 8

reducer complexity: quadratic(e. g., pairwise comparison)

I average cluster size: 488 = 6

I cluster cost: 62 = 36I partition cost: 8 · 36 = 288

14

Page 30: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: Calculation

Agg

rega

te tuples: 48

clusters: 8

reducer complexity: quadratic(e. g., pairwise comparison)

I average cluster size: 488 = 6

I cluster cost: 62 = 36I partition cost: 8 · 36 = 288

14

Page 31: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: Calculation

Agg

rega

te tuples: 48

clusters: 8

reducer complexity: quadratic(e. g., pairwise comparison)

I average cluster size: 488 = 6

I cluster cost: 62 = 36

I partition cost: 8 · 36 = 288

14

Page 32: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Estimate Partition Cost: Calculation

Agg

rega

te tuples: 48

clusters: 8

reducer complexity: quadratic(e. g., pairwise comparison)

I average cluster size: 488 = 6

I cluster cost: 62 = 36I partition cost: 8 · 36 = 288

14

Page 33: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

15

Page 34: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

reducer 1 reducer 20 0

15

Page 35: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

reducer 1 reducer 2050

15

Page 36: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

reducer 1 reducer 250 40

15

Page 37: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

reducer 1 reducer 250 60

15

Page 38: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

reducer 1 reducer 26065

15

Page 39: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

reducer 1 reducer 265 63

15

Page 40: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

How many Partitions?

I static approachI number of partitions fixed beforehand by controllerI matching partitions created on every mapper

I dynamic approachI only initial number of partitions fixedI mappers may split some partitions further

16

Page 41: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

How many Partitions?

I static approachI number of partitions fixed beforehand by controllerI matching partitions created on every mapper

I dynamic approachI only initial number of partitions fixedI mappers may split some partitions further

16

Page 42: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Evaluation

I load balancing qualityI influence on execution time

Synthetic Data SetsI 520M tuplesI 2 000 clustersI Zipf distributions

Millennium Data SetI 760M tuplesI ≈ 32 000 clusters

17

Page 43: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Evaluation: Load Balancing

I measure standard deviation of partition cost per reducerI small is good: small differences between reducers

01020304050

10 20 25 50

stdev [% of mean]

reducers

Synthetic Data, Moderate Skew

01020304050

10 20 25 50

stdev [% of mean]

reducers

Millennium Data

Standard MapReduce Fine Partitioning

18

Page 44: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Evaluation: Execution Time

I measure execution time per reducerI max time: overall time of reduce job

05

101520

10 20 25 50

Synthetic Execution Time

reducers

Synthetic Data, Moderate Skew

010203040

10 20 25 50

Synthetic Execution Time

reducers

Millennium Data

Standard MapReduce Fine PartitioningTime for Most Expensive Cluster

19

Page 45: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Summary

I scientific data analysis with MapReduceI partition-based load balancingI distributed monitoring for cost estimationI experimental evaluation

I better load balancingI reduced overall execution time ∑

| lc

20

Page 46: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis

Technische Universitat Munchen

Thank you!

21