![Page 1: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/1.jpg)
Technische Universitat Munchen
Handing Data Skew in MapReduce
B. Gufler♠, N. Augsten♣, A. Reiser♠, A. Kemper♠
♠Technische Universitat Munchen ♣Free University of Bolzano-Bozen
May 8, 2011
![Page 2: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/2.jpg)
Technische Universitat Munchen
Cloud Data Processing with MapReduce
MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks
MapReduce for eScience: new challenges
1. heavy data skew
2. complex reducer tasks
; How to deal with these problems?
1
![Page 3: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/3.jpg)
Technische Universitat Munchen
Cloud Data Processing with MapReduce
MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks
MapReduce for eScience: new challenges
1. heavy data skew
2. complex reducer tasks
; How to deal with these problems?
1
![Page 4: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/4.jpg)
Technische Universitat Munchen
Cloud Data Processing with MapReduce
MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks
MapReduce for eScience: new challenges
1. heavy data skew
2. complex reducer tasks
; How to deal with these problems?
1
![Page 5: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/5.jpg)
Technische Universitat Munchen
Cloud Data Processing with MapReduce
MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks
MapReduce for eScience: new challenges
1. heavy data skew
2. complex reducer tasks
; How to deal with these problems?
1
![Page 6: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/6.jpg)
Technische Universitat Munchen
Data Skew
I values of an attribute not evenly distributedI very common in scientific dataI causes load imbalance if attribute is used for partitioning
2
![Page 7: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/7.jpg)
Technische Universitat Munchen
Example: Node Masses in Millennium Simulation
0
100
200
300
400nodes (∗1M)
node mass1–
50
434M
51–5
00
295M
501–
5k
29M
5k–5
0k
2.5M
50k–
500k
143k
500k
–5M
1.8k
3
![Page 8: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/8.jpg)
Technische Universitat Munchen
Complex Reducer Tasks
I complex scientific analysis algorithmsI often polynomial or even exponential complexity
I amplifies load imbalance
Map
Map
Map
Reduce
Reduce
6 · 22 = 24 42 + 82 = 80
4
![Page 9: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/9.jpg)
Technische Universitat Munchen
Complex Reducer Tasks
I complex scientific analysis algorithmsI often polynomial or even exponential complexity
I amplifies load imbalance
Map
Map
Map
Reduce
Reduce
6 · 22 = 24
42 + 82 = 80
4
![Page 10: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/10.jpg)
Technische Universitat Munchen
Complex Reducer Tasks
I complex scientific analysis algorithmsI often polynomial or even exponential complexity
I amplifies load imbalance
Map
Map
Map
Reduce
Reduce
6 · 22 = 24 42 + 82 = 80
4
![Page 11: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/11.jpg)
Technische Universitat Munchen
Example: Frequent Subtree Mining
I find common substructures in(large parts of) a forest
I exponential complexity forunordered subtrees
aT1
b c
d a
dT2
a
c
a
bT3
a
d
c
d
5
![Page 12: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/12.jpg)
Technische Universitat Munchen
Example: Frequent Subtree Mining
I find common substructures in(large parts of) a forest
I exponential complexity forunordered subtrees
aT1
b c
d a
dT2
a
c
a
bT3
a
d
c
d
5
![Page 13: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/13.jpg)
Technische Universitat Munchen
Example: Frequent Subtree Mining
I find common substructures in(large parts of) a forest
I exponential complexity forunordered subtrees
aT1
b c
d a
dT2
a
c
a
bT3
a
d
c
d
5
![Page 14: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/14.jpg)
Technische Universitat Munchen
Example: Frequent Subtree Mining
I find common substructures in(large parts of) a forest
I exponential complexity forunordered subtrees
aT1
b c
d a
dT2
a
c
a
bT3
a
d
c
d
5
![Page 15: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/15.jpg)
Technische Universitat Munchen
Summary: MapReduce for eScience
ChallengesI load imbalance due to data skewI complex analysis amplifies execution time differencesI example: frequent subtree mining (PathJoin algorithm)
I 16 GB data set (subset of the Millennium simulation)I cluster of 16 nodesI total execution time: 8 hoursI execution time difference of reducers: up to 6 hours
6
![Page 16: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/16.jpg)
Technische Universitat Munchen
Agenda
I MotivationI Load Balancing in MapReduce: Big PictureI Estimation of Partition CostI Assignment of Partitions to ReducersI Number of PartitionsI Evaluation
7
![Page 17: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/17.jpg)
Technische Universitat Munchen
Data Partitioning in MapReduce
Map
cluster: all itemswith the samekey
partition: all keyswith the samehash values
hash function fordata partitioning
8
![Page 18: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/18.jpg)
Technische Universitat Munchen
Load Balancing: Current MapReduceP
0P
1
P0
P1
P0
P1
Map
Red
uce
P0 P1 cont
rolle
r
I hash partitioning on each mapperI static 1:1 assignment of partitions to
reducersI balances number of keys per
reducer instead of workloadI no flexibility
9
![Page 19: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/19.jpg)
Technische Universitat Munchen
Improved Load Balancing
I create more partitions than there are reducersI cost-based assignment of partitions to reducers
P0
P1
P2
P0
P1
P2
P0
P1
P2
Map
Red
uce
P0+P2 P1 cont
rolle
r
I flexibility for load balancingI challenges
1. compute number ofpartitions
2. estimate partition cost3. assign partitions to
reducers
10
![Page 20: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/20.jpg)
Technische Universitat Munchen
Estimate Partition Cost
partition cost depends on1. number and size of the clusters within the partition
I locally monitored on each mapperI centrally aggregated on one controller
2. complexity of the reducer algorithmI specified by the user
; estimate cluster cost based on this information
; sum up cluster costs to obtain partition cost
11
![Page 21: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/21.jpg)
Technische Universitat Munchen
Estimate Partition Cost: Monitoring
I tuple countI monitor local count on every mapperI sum up on controller
I cluster countI create Bloom filter on every mapperI employ Linear Counting on controller
12
![Page 22: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/22.jpg)
Technische Universitat Munchen
Estimate Partition Cost: MonitoringM
appe
r1
tuples: 17
clusters: [10011]
P0
Map
per2
tuples: 31
clusters: [01011]
P0
13
![Page 23: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/23.jpg)
Technische Universitat Munchen
Estimate Partition Cost: MonitoringM
appe
r1
tuples: 17
clusters: [10011]
P0
Map
per2
tuples: 31
clusters: [01011]
P0
Agg
rega
te
13
![Page 24: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/24.jpg)
Technische Universitat Munchen
Estimate Partition Cost: MonitoringM
appe
r1
tuples: 17
clusters: [10011]
P0
Map
per2
tuples: 31
clusters: [01011]
P0
Agg
rega
te
∑
13
![Page 25: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/25.jpg)
Technische Universitat Munchen
Estimate Partition Cost: MonitoringM
appe
r1
tuples: 17
clusters: [10011]
P0
Map
per2
tuples: 31
clusters: [01011]
P0
Agg
rega
tetuples: 48∑
13
![Page 26: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/26.jpg)
Technische Universitat Munchen
Estimate Partition Cost: MonitoringM
appe
r1
tuples: 17
clusters: [10011]
P0
Map
per2
tuples: 31
clusters: [01011]
P0
Agg
rega
tetuples: 48∑|
13
![Page 27: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/27.jpg)
Technische Universitat Munchen
Estimate Partition Cost: MonitoringM
appe
r1
tuples: 17
clusters: [10011]
P0
Map
per2
tuples: 31
clusters: [01011]
P0
Agg
rega
tetuples: 48∑| lc
K. Whang, B. Vander-Zanden, H. Taylor: A Linear-Time Probabilistic Counting Algorithm for Databa-se Applications, ACM TODS 15(2), 1990
13
![Page 28: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/28.jpg)
Technische Universitat Munchen
Estimate Partition Cost: MonitoringM
appe
r1
tuples: 17
clusters: [10011]
P0
Map
per2
tuples: 31
clusters: [01011]
P0
Agg
rega
tetuples: 48
clusters: 8
∑| lc
K. Whang, B. Vander-Zanden, H. Taylor: A Linear-Time Probabilistic Counting Algorithm for Databa-se Applications, ACM TODS 15(2), 1990
13
![Page 29: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/29.jpg)
Technische Universitat Munchen
Estimate Partition Cost: Calculation
Agg
rega
te tuples: 48
clusters: 8
reducer complexity: quadratic(e. g., pairwise comparison)
I average cluster size: 488 = 6
I cluster cost: 62 = 36I partition cost: 8 · 36 = 288
14
![Page 30: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/30.jpg)
Technische Universitat Munchen
Estimate Partition Cost: Calculation
Agg
rega
te tuples: 48
clusters: 8
reducer complexity: quadratic(e. g., pairwise comparison)
I average cluster size: 488 = 6
I cluster cost: 62 = 36I partition cost: 8 · 36 = 288
14
![Page 31: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/31.jpg)
Technische Universitat Munchen
Estimate Partition Cost: Calculation
Agg
rega
te tuples: 48
clusters: 8
reducer complexity: quadratic(e. g., pairwise comparison)
I average cluster size: 488 = 6
I cluster cost: 62 = 36
I partition cost: 8 · 36 = 288
14
![Page 32: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/32.jpg)
Technische Universitat Munchen
Estimate Partition Cost: Calculation
Agg
rega
te tuples: 48
clusters: 8
reducer complexity: quadratic(e. g., pairwise comparison)
I average cluster size: 488 = 6
I cluster cost: 62 = 36I partition cost: 8 · 36 = 288
14
![Page 33: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/33.jpg)
Technische Universitat Munchen
Assign Partitions to Reducers
I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead
50 40 20 15 3
15
![Page 34: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/34.jpg)
Technische Universitat Munchen
Assign Partitions to Reducers
I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead
50 40 20 15 3
reducer 1 reducer 20 0
15
![Page 35: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/35.jpg)
Technische Universitat Munchen
Assign Partitions to Reducers
I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead
50 40 20 15 3
reducer 1 reducer 2050
15
![Page 36: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/36.jpg)
Technische Universitat Munchen
Assign Partitions to Reducers
I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead
50 40 20 15 3
reducer 1 reducer 250 40
15
![Page 37: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/37.jpg)
Technische Universitat Munchen
Assign Partitions to Reducers
I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead
50 40 20 15 3
reducer 1 reducer 250 60
15
![Page 38: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/38.jpg)
Technische Universitat Munchen
Assign Partitions to Reducers
I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead
50 40 20 15 3
reducer 1 reducer 26065
15
![Page 39: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/39.jpg)
Technische Universitat Munchen
Assign Partitions to Reducers
I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead
50 40 20 15 3
reducer 1 reducer 265 63
15
![Page 40: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/40.jpg)
Technische Universitat Munchen
How many Partitions?
I static approachI number of partitions fixed beforehand by controllerI matching partitions created on every mapper
I dynamic approachI only initial number of partitions fixedI mappers may split some partitions further
16
![Page 41: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/41.jpg)
Technische Universitat Munchen
How many Partitions?
I static approachI number of partitions fixed beforehand by controllerI matching partitions created on every mapper
I dynamic approachI only initial number of partitions fixedI mappers may split some partitions further
16
![Page 42: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/42.jpg)
Technische Universitat Munchen
Evaluation
I load balancing qualityI influence on execution time
Synthetic Data SetsI 520M tuplesI 2 000 clustersI Zipf distributions
Millennium Data SetI 760M tuplesI ≈ 32 000 clusters
17
![Page 43: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/43.jpg)
Technische Universitat Munchen
Evaluation: Load Balancing
I measure standard deviation of partition cost per reducerI small is good: small differences between reducers
01020304050
10 20 25 50
stdev [% of mean]
reducers
Synthetic Data, Moderate Skew
01020304050
10 20 25 50
stdev [% of mean]
reducers
Millennium Data
Standard MapReduce Fine Partitioning
18
![Page 44: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/44.jpg)
Technische Universitat Munchen
Evaluation: Execution Time
I measure execution time per reducerI max time: overall time of reduce job
05
101520
10 20 25 50
Synthetic Execution Time
reducers
Synthetic Data, Moderate Skew
010203040
10 20 25 50
Synthetic Execution Time
reducers
Millennium Data
Standard MapReduce Fine PartitioningTime for Most Expensive Cluster
19
![Page 45: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/45.jpg)
Technische Universitat Munchen
Summary
I scientific data analysis with MapReduceI partition-based load balancingI distributed monitoring for cost estimationI experimental evaluation
I better load balancingI reduced overall execution time ∑
| lc
20
![Page 46: Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary: MapReduce for eScience Challenges I load imbalance due to data skew I complex analysis](https://reader035.vdocuments.mx/reader035/viewer/2022070111/6050b0284afd30403a7e6564/html5/thumbnails/46.jpg)
Technische Universitat Munchen
Thank you!
21