lecture 14:combating outliers in mapreduce clusters xiaowei yang
TRANSCRIPT
Lecture 14:Combating Outliers in MapReduce Clusters
Xiaowei Yang
• References:– Reining in the Outliers in Map-Reduce Clusters
using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris
– http://research.microsoft.com/en-us/UM/people/srikanth/data/Combating%20Outliers%20in%20Map-Reduce.web.pptx
log(size of dataset)GB109
TB1012
PB1015
EB1018
log(size of cluster)
104
1
103
102
101
105
HPC,|| databases
mapreduce
MapReduce • Decouples customized data operations from mechanisms to scale• Is widely used
• Cosmos (based on SVC’s Dryad) + Scope @ Bing• MapReduce @ Google• Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)
e.g., the Internet, click logs, bio/genomic data
3
Local write
An Example
How it Works:
Goal Find frequent search queries to Bing
SELECT Query, COUNT(*) AS FreqFROM QueryTableHAVING Freq > X
What the user says:
Read Map Reduce
file block 0
job manager
task
task
tasktask
task
output block 0
output block 1
file block 1
file block 2
file block 3
assign work, get progress
4
Outliers slow down map-reduce jobs
Map.Read 22K
Map.Move 15K
Map 13K
Reduce 51K
Barrier
File System
Goals• Speeding up jobs improves productivity• Predictability supports SLAs•… while using resources efficiently
We find that:
5
What is an outlier
• A phase (map or reduce) has n tasks and s slots (available compute resources)
• Every task takes T seconds to run
• ti = f (datasize, code, machine, network)
• Ideally run time = ceiling (n/s) * T• A naïve scheduler
• Goal is to be closer to
tti
n i
smax
sn it
From a phase to a job
• A job may have many phases• An outlier in an early phase has a cumulative
effect• Data loss may cause multi-phase recompute
outliers
Delay due to a recompute readily cascades
Why outliers?
reduce
sortDelay due to a recompute
map
Problem: Due to unavailable input, tasks have to be recomputed
8
Previous work
• The original MapReduce paper observed the problem
• But didn’t deal with it in depth
• Solution was to duplicate the slow tasks• Drawbacks– Some may be unnecessary – Use extra resources– Placement may be the problem
Quantifying the Outlier Problem
• Approach:– Understanding the problem first before proposing
solutions– Understanding often leads to solutions
1. Prevalence of outliers
2. Causes of outliers
3. Impact of outliers
stragglers = Tasks that take 1.5 times the median task in that phaserecomputes = Tasks that are re-run because their output was lost
•50% phases have 10% stragglers and no recomputes
•10% of the stragglers take >10X longer
•50% phases have 10% stragglers and no recomputes
•10% of the stragglers take >10X longer
Why bother? Frequency of Outliers
straggler straggler
Outlier
11
Causes of outliners: data skew
• In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network
Duplicating will not help!
Non-outliers can be improved as well
• 20% of them are 55% longer than median
Reduce task
Map output
uneven placement is typical in production• reduce tasks are placed at first available slot
Problem: Tasks reading input over the network experience variable congestion
14
Causes of outliers: cross rack traffic
• 70% of cross track traffic is reduce traffic• Tasks in a spot with slow network run slower• Tasks compete network among themselves• Reduce reads from every map• Reduce is put into any spare slot
50% phases takes 62% longer to finish than ideal placement
Cause of outliers: bad and busy machines
• 50% of recomputes happen on 5% of the machines
• Recompute increases resource usage
• Outliers cluster by time– Resource contention might be the cause
• Recomputes cluster by machines– Data loss may cause multiple recomputes
Why bother? Cost of outliers(what-if analysis, replays logs in a trace driven simulator)
At median, jobs slowed down by 35% due to outliers
At median, jobs slowed down by 35% due to outliers
18
Mantri Design
High-level idea
• Cause aware, and resource aware• Runtime = f (input, network, machine,
datatoProcess, …)• Fix each problem with different strategies
Resource-aware restarts
• Duplicate or kill long outliers
When to restart
• Every ∆ seconds, tasks report progress
• Estimate trem and tnew
• γ= 3• Schedule a duplicate if the total running time
is smaller
• P(c trem > (c+1) tnew) > δ
• When there are available slots, restart if reduction time is more than restart time– E(trem – tnew ) > ρ ∆
Network Aware Placement
• Compute the rack location for each task• Find the placement that minimizes the
maximum data transfer time
If rack i has di map output and ui, vi bandwidths available on uplink and downlink,
Place ai fraction of reduces such that:
Avoid recomputation
• Replicating the output – Restart a task if data are lost– Replicate the most costly job
Data-aware task ordering
• Outliers due to large input
• Schedule tasks in descending order of dataToProcess
• At most 33% worse than optimal scheduling
Estimation of trem and tnew
• d: input data size
• dread: the amount read
Estimation of tnew
• processRate: estimated of all tasks in the phase• locationFactor: machine performance• d: input size
Results
Deployed in production cosmos clusters• Prototype Jan’10 baking on pre-prod. clusters release May’10
Trace driven simulations• thousands of jobs• mimic workflow, task runtime, data skew, failure prob.• compare with existing schemes and idealized oracles
29
Evaluation Methodology
• Mantri run on production clusters• Baseline is results from Dryad• Use trace-driven simulations to compare with
other systems
Comparing jobs in the wild
• w/ and w/o Mantri for one month of jobs in Bing production cluster
31
340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release)
In production, restarts…
improve on native cosmos by 25% while using fewer resources
32
In trace-replay simulations, restarts…
are much better dealt with in a cause-, resource- aware manner.Each job repeated thrice
CD
F %
clu
ster
res
ourc
es
CD
F %
clu
ster
res
ourc
es
33
Network-aware Placement
• Equal: all links have the same bandwidth• Start: same as the start• Ideal: available bandwidth at run time 34
Protecting against recomputes
CD
F %
clu
ster
res
ourc
es
35
Summary
a) Reduce recomputation: preferentially replicate costly-to-recompute tasks
b) Poor network: each job locally avoids network hot-spots
c) Bad machines: quarantine persistently faulty machines
d) DataToProcess: schedule in descending order of data size
e) Others: restart or duplicate tasks, cognizant of resource cost. Prune
Conclusion
• Outliers in map-reduce clusters are a significant problem
• happen due to many causes– interplay between storage, network and map-reduce
• cause-, resource- aware mitigation improves on prior art
37