lecture 14:combating outliers in mapreduce clusters xiaowei yang

Lecture 14:Combating Outliers in MapReduce Clusters

Xiaowei Yang

• References:– Reining in the Outliers in Map-Reduce Clusters

using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris

– http://research.microsoft.com/en-us/UM/people/srikanth/data/Combating%20Outliers%20in%20Map-Reduce.web.pptx

log(size of dataset)GB109

TB1012

PB1015

EB1018

log(size of cluster)

104

1

103

102

101

105

HPC,|| databases

mapreduce

MapReduce • Decouples customized data operations from mechanisms to scale• Is widely used

• Cosmos (based on SVC’s Dryad) + Scope @ Bing• MapReduce @ Google• Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)

e.g., the Internet, click logs, bio/genomic data

3

Local write

An Example

How it Works:

Goal Find frequent search queries to Bing

SELECT Query, COUNT(*) AS FreqFROM QueryTableHAVING Freq > X

What the user says:

Read Map Reduce

file block 0

job manager

task

task

tasktask

task

output block 0

output block 1

file block 1

file block 2

file block 3

assign work, get progress

4

Outliers slow down map-reduce jobs

Map.Read 22K

Map.Move 15K

Map 13K

Reduce 51K

Barrier

File System

Goals• Speeding up jobs improves productivity• Predictability supports SLAs•… while using resources efficiently

We find that:

5

What is an outlier

• A phase (map or reduce) has n tasks and s slots (available compute resources)

• Every task takes T seconds to run

• ti = f (datasize, code, machine, network)

• Ideally run time = ceiling (n/s) * T• A naïve scheduler

• Goal is to be closer to

tti

n i

smax

sn it

From a phase to a job

• A job may have many phases• An outlier in an early phase has a cumulative

effect• Data loss may cause multi-phase recompute

outliers

Delay due to a recompute readily cascades

Why outliers?

reduce

sortDelay due to a recompute

map

Problem: Due to unavailable input, tasks have to be recomputed

8

Previous work

• The original MapReduce paper observed the problem

• But didn’t deal with it in depth

• Solution was to duplicate the slow tasks• Drawbacks– Some may be unnecessary – Use extra resources– Placement may be the problem

Quantifying the Outlier Problem

• Approach:– Understanding the problem first before proposing

solutions– Understanding often leads to solutions

1. Prevalence of outliers

2. Causes of outliers

3. Impact of outliers

stragglers = Tasks that take 1.5 times the median task in that phaserecomputes = Tasks that are re-run because their output was lost

•50% phases have 10% stragglers and no recomputes

•10% of the stragglers take >10X longer

•50% phases have 10% stragglers and no recomputes

•10% of the stragglers take >10X longer

Why bother? Frequency of Outliers

straggler straggler

Outlier

11

Causes of outliners: data skew

• In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network

Duplicating will not help!

Non-outliers can be improved as well

• 20% of them are 55% longer than median

Reduce task

Map output

uneven placement is typical in production• reduce tasks are placed at first available slot

Problem: Tasks reading input over the network experience variable congestion

14

Causes of outliers: cross rack traffic

• 70% of cross track traffic is reduce traffic• Tasks in a spot with slow network run slower• Tasks compete network among themselves• Reduce reads from every map• Reduce is put into any spare slot

50% phases takes 62% longer to finish than ideal placement

Cause of outliers: bad and busy machines

• 50% of recomputes happen on 5% of the machines

• Recompute increases resource usage

• Outliers cluster by time– Resource contention might be the cause

• Recomputes cluster by machines– Data loss may cause multiple recomputes

Why bother? Cost of outliers(what-if analysis, replays logs in a trace driven simulator)

At median, jobs slowed down by 35% due to outliers

At median, jobs slowed down by 35% due to outliers

18

Mantri Design

High-level idea

• Cause aware, and resource aware• Runtime = f (input, network, machine,

datatoProcess, …)• Fix each problem with different strategies

Resource-aware restarts

• Duplicate or kill long outliers

When to restart

• Every ∆ seconds, tasks report progress

• Estimate trem and tnew

• γ= 3• Schedule a duplicate if the total running time

is smaller

• P(c trem > (c+1) tnew) > δ

• When there are available slots, restart if reduction time is more than restart time– E(trem – tnew ) > ρ ∆

Network Aware Placement

• Compute the rack location for each task• Find the placement that minimizes the

maximum data transfer time

If rack i has di map output and ui, vi bandwidths available on uplink and downlink,

Place ai fraction of reduces such that:

Avoid recomputation

• Replicating the output – Restart a task if data are lost– Replicate the most costly job

Data-aware task ordering

• Outliers due to large input

• Schedule tasks in descending order of dataToProcess

• At most 33% worse than optimal scheduling

Estimation of trem and tnew

• d: input data size

• dread: the amount read

Estimation of tnew

• processRate: estimated of all tasks in the phase• locationFactor: machine performance• d: input size

Results

Deployed in production cosmos clusters• Prototype Jan’10 baking on pre-prod. clusters release May’10

Trace driven simulations• thousands of jobs• mimic workflow, task runtime, data skew, failure prob.• compare with existing schemes and idealized oracles

29

Evaluation Methodology

• Mantri run on production clusters• Baseline is results from Dryad• Use trace-driven simulations to compare with

other systems

Comparing jobs in the wild

• w/ and w/o Mantri for one month of jobs in Bing production cluster

31

340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release)

In production, restarts…

improve on native cosmos by 25% while using fewer resources

32

In trace-replay simulations, restarts…

are much better dealt with in a cause-, resource- aware manner.Each job repeated thrice

CD

F %

clu

ster

res

ourc

es

CD

F %

clu

ster

res

ourc

es

33

Network-aware Placement

• Equal: all links have the same bandwidth• Start: same as the start• Ideal: available bandwidth at run time 34

Protecting against recomputes

CD

F %

clu

ster

res

ourc

es

35

Summary

a) Reduce recomputation: preferentially replicate costly-to-recompute tasks

b) Poor network: each job locally avoids network hot-spots

c) Bad machines: quarantine persistently faulty machines

d) DataToProcess: schedule in descending order of data size

e) Others: restart or duplicate tasks, cognizant of resource cost. Prune

Conclusion

• Outliers in map-reduce clusters are a significant problem

• happen due to many causes– interplay between storage, network and map-reduce

• cause-, resource- aware mitigation improves on prior art

37

lecture 14:combating outliers in mapreduce clusters xiaowei yang

Documents

problem slide

median slide

ideal placement slide

costly job slide

cause of outliers

mantri design slide

causes of outliers

different strategies