cof low mending the application-network gap in big data analytics mosharaf chowdhury uc berkeley

60
Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Upload: amberly-henderson

Post on 23-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Cof lowMending the Application-Network Gapin Big Data Analytics

Mosharaf Chowdhury

UC Berkeley

Page 2: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Big Data

The volume of data businesses want to make sense of is increasing

Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …

Cheaper disks, SSDs, and memory

Stalling processor speeds

Page 3: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Big Datacenters for Massive Parallelism

2005 2010 2015

MapReduce Hadoop

Spark

HiveDryad

DryadLINQ

Spark-Streaming

GraphXGraphLabPregel

Storm

Dremel

BlinkDB

Page 4: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Data-Parallel Applications

Multi-stage dataflow• Computation interleaved with

communication

Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel

Communication Stage (e.g., Shuffle)• Between successive computation

stages

Map Stage

Shuffle

Reduce Stage

A communication stage cannot complete until all the data have

been transferred

Page 5: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Communication is Crucial

Performance

As SSD-based and in-memory systems proliferate,the network is likely to become the primary

bottleneck

1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

Facebook jobs spend ~25% of runtime on average in intermediate comm.1

Page 6: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

FasterCommunicatio

nStages:

NetworkingApproach

“Configuration should be handled at the system

level”

FlowTransfers data from a source to a destination

Independent unit of allocation, sharing, load balancing, and/orprioritization

Page 7: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Existing Solutions

GPS RED

WFQ CSFQ

ECN XCP D2TCPDCTCP

PDQD3

FCP

DeTail pFabric

2005 2010 20151980s 1990s 2000s

RCP

Per-Flow Fairness Flow Completion Time

Independent flows cannot capture the collective communication behavior common in data-parallel

applications

Page 8: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

DatacenterNetwork

1

2

3

1

2

3

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3

Input Links Output Links

Page 9: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3

DatacenterNetwork

1

2

3

1

2

3

r1

r2

s1

s2

s3

Page 10: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Why Do They Fall Short?

DatacenterNetwork

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletion

Time = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

Solutions focusing on flow completion time cannot

further decrease the shuffle completion time

Page 11: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Improve Application-Level Performance1

DatacenterNetwork

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletion

Time = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

Slow down faster flows to

accelerate slower flows

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 4

Avg. FlowCompletionTime = 4

444

444

Data-Proportional Allocation

Page 12: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Applications know their performance

goals, but they have no means to

let the network know

FasterCommunicatio

nStages:SystemsApproach

“Configuration should be handled by the end

users”

Page 13: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

FasterCommunicatio

nStages:

NetworkingApproach

“Configuration should be handled at the system

level”

FasterCommunicatio

nStages:SystemsApproach

“Configuration should be handled by the end

users”

MIND THE GAPME

Holistic ApproachApplications and the Network Working Together

Page 14: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Cof lowCommunication abstraction for data-parallel applications to express their performance goals

1. Minimize completion times,2. Meet deadlines, or 3. Perform fair allocation.

Page 15: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Aggregation

Broadcast

ShuffleParallel Flows

All-to-All

Single Flow

Page 16: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

How to schedule coflows online …

… for faster #1 completion of coflows?

… to meet #2 more deadlines?

… for fair #3 allocation of the network?

1

2

N

1

2

N

.

.

.

.

.

.

Datacenter

Page 17: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Varys Enables coflows in data-intensive clusters

1. Coflow Scheduler Faster, application-aware data transfers throughout the network

2. Global Coordination

Consistent calculation and enforcement of scheduler decisions

3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.

1

Page 18: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Communication abstraction for data-parallel applications to express their performance goals

Cof low

1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows.

Page 19: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Benefits of

time2 4 6 time2 4 6 time2 4 6

Coflow1 comp. time = 5Coflow2 comp. time = 6

Coflow1 comp. time = 5Coflow2 comp. time = 6

Fair Sharing Smallest-Flow First1,2 The Optimal

Coflow1 comp. time = 3Coflow2 comp. time = 6

L1

L2

L1

L2

L1

L2

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

Inter-Coflow Scheduling

Page 20: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

time2 4 6

Coflow1 comp. time = 6Coflow2 comp. time = 6

Fair Sharing

L1

L2

time2 4 6

Coflow1 comp. time = 6Coflow2 comp. time = 6

Flow-level Prioritization1

L1

L2

time2 4 6

The Optimal

Coflow1 comp. time = 3Coflow2 comp. time = 6

L1

L2

Benefits of

Concurrent Open Shop Scheduling1

• Examples include job scheduling and caching blocks

• Solutions use a ordering heuristic

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

1. A Note on the Complexity of the Concurrent Open Shop Problem, Journal of Scheduling, 9(4):389–396, 2006

Inter-Coflow Scheduling

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

is NP-Hard

Page 21: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Inter-Coflow Scheduling

1

2

3

1

2

3

Input Links Output Links

Datacenter

Concurrent Open Shop Scheduling with Coupled Resources• Examples include job

scheduling and caching blocks

• Solutions use a ordering heuristic

• Consider matching constraints

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

3

6

2

is NP-Hard

Page 22: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

VarysEmploys a two-step algorithm to minimize coflow completion times

1. Ordering heuristic Keep an ordered list of coflows to be scheduled, preempting if needed

2. Allocation algorithm

Allocates minimum required resources to each coflow to finish in minimum time

Page 23: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Ordering Heuristic

Datacenter13

O3

O2

Time

O1

C2 ends

C1 ends

3

C1 C2 C3

Length 3 5 6

Shortest-First19

C3 ends

(Total CCT = 35)

1

2

3

1

2

3

3

3

3

6

5

5

Page 24: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Ordering Heuristic

Datacenter

1

2

3

1

2

313

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

16

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

C1 C2 C3

Width 3 2 1

3

3

3

6

5

5

(Total CCT = 41)(35)

Page 25: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Ordering Heuristic

Datacenter

1

2

3

1

2

316

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

9

Smallest-First6

C3 endsC1 C2 C3

Size 9 10

6

3

3

3

6

5

5

(41)

(34)

13

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

(35)

Page 26: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Ordering Heuristic

Datacenter

1

2

3

1

2

3

19

O3

O2

Time

O1

C2 ends

C1 ends

9

Smallest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

3

Smallest-Bottleneck9

C3 endsC1 C2 C3

Bottleneck 3 10

6

3

3

3

6

5

5

(34) (31)

16

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

(41)

13

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

(35)

Page 27: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Allocation Algorithm

A coflow cannot finish before its very last flow

Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time

Allocate minimum flow rates such that all flows of a coflow finish together on time

Page 28: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Varys Enables coflows in data-intensive clusters

1. Coflow Scheduler Faster, application-aware data transfers throughout the network

2. Global Coordination

Consistent calculation and enforcement of scheduler decisions

3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users

Page 29: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

The Need for Coordination

1

2

3

1

2

3

4

3O3

O2

Time

O1

C2 ends

C1 ends

4

C1 C2

Bottleneck 4 5

Schedulingwith

Coordination

54 9

(Total CCT = 13)

Page 30: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

The Need for Coordination

1

2

3

1

2

3

O3

O2

Time

O1

C2 ends

C1 ends

4

Schedulingwith

Coordination

9

O3

O2

Time

O1

C2 ends

C1 ends

7

Schedulingwithout

Coordination

12

Uncoordinated local decisions interleave coflows, hurting performance

4

3

54

(Total CCT = 13) (Total CCT = 19)

Page 31: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Varys Architecture

Centralized master-slave architecture • Applications use a client library

to communicate with the master

Actual timing and rates are determined by the coflow scheduler

Put Get Reg

Varys Master

Coflow Scheduler

Topology

Monitor

UsageEstimat

or

Network Interface

(Distributed) File System

fComp. Tasks callingVarys Client Library

TaskName

Sender Receiver Driver

Varys Daemo

n

Varys Daemo

n

Varys Daemo

n

1. Download from http://varys.net

Page 32: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Varys Enables coflows in data-intensive clusters

1. Coflow Scheduler Faster, application-aware data transfers throughout the network

2. Global Coordination

Consistent calculation and enforcement of scheduler decisions

3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users

Page 33: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

The Coflow API

• register• put• get• unregister

mappers

reducers

shuffl

e

driver (JobTracker)

bro

adca

st @mapperb.get(id)…

@reducers.get(idsl)…

@driverb register(BROADCAST)

id b.put(content)…

ids1 s.put(content) …

s.unregister()b.unregister()

1. NO changes to user jobs

2. NO storage management

s register(SHUFFLE)

Page 34: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

1. Does it improve performance?2. Can it beat non-preemptive

solutions?3. Do we really need

coordination?YES

EvaluationA 3000-machine trace-driven simulation matched against a 100-machine EC2 deployment

Page 35: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Better than Per-Flow Fairness

Sim.

EC2 1.85X 1.25X

3.21X 1.11X

Comm. Improv. Job Improv.

2.50X3.16X

3.39X4.86X

Comm. Heavy

Page 36: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Preemption is Necessary [Sim.]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011

Varys Fair FIFO Priority FIFO-LM NC0

5

10

15

20

25

1.003.21

5.65 5.53

22.07

1.10

Ove

rhead O

ver

Vary

s

Varys Varys1 4

Per-FlowFairness

Per-FlowPrioritization

2,3

WhatAbout

StarvationNO

Page 37: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Lack of Coordination Hurts [Sim.]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

Smallest-flow-first (per-flow priorities)

• Minimizes flow completion time

FIFO-LM4 performs decentralized coflow scheduling

• Suffers due to local decisions• Works well for small, similar

coflows

Varys Fair FIFO Priority FIFO-LM NC0

5

10

15

20

25

1.003.21

5.65 5.53

22.07

1.10

Ove

rhead O

ver

Vary

s

Varys Varys1 4

Per-FlowFairness

Per-FlowPrioritization

2,3

Page 38: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Cof low1. The size of each flow, 2. The total number of flows,

and 3. The endpoints of individual

flows.

Pipelining between stages Speculative executions Task failures and restarts

Communication abstraction for data-parallel applications to express their performance goals

Page 39: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

How to Perform Coflow Scheduling Without Complete Knowledge?

Page 40: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Implications

Minimize Avg. Comp. Time Flows in a Single Link Coflows in an Entire

Datacenter

With complete knowledge Smallest-Flow-FirstOrdering by Bottleneck Size + Data-Proportional Rate Allocation

Without complete knowledge

Least-Attained Service (LAS) ?

Page 41: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Revisiting Ordering Heuristics

1

2

3

1

2

313

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

16

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

9

Smallest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

3

Smallest-Bottleneck9

C3 ends

3

3

3

6

5

5

(35) (41)

(34) (31)

Page 42: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Coflow-Aware LAS (CLAS)

Set priority that decreases with how much a coflow has already sent• The more a coflow has sent, the lower its priority• Smaller coflows finish faster

Use total size of coflows to set priorities• Avoids the drawbacks of full decentralization

Page 43: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Coflow-Aware LAS (CLAS)Continuous priority reduces to fair sharing when similar coflows coexist• Priority oscillation

FIFO works well for similar coflows• Avoids the drawbacks of full

decentralization

Coflow 1

Coflow 2

time2 4 6

Coflow1 comp. time = 6Coflow2 comp. time = 6

time2 4 6

Coflow1 comp. time = 3Coflow2 comp. time = 6

Page 44: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Discretized Coflow-Aware LAS (D-CLAS)

Priority discretization• Change priority when total size

exceeds predefined thresholds

Scheduling policies• FIFO within the same queue• Prioritization across queue

Weighted sharing across queues• Guarantees starvation avoidance

FIFO

FIFO

FIFO

Highest-PriorityQueue

Lowest-PriorityQueue

QK

Q2

Q1

Page 45: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

How to Discretize Priorities?

E1A

E1A

EK-1A

Exponentially spaced thresholds• K : number of queues• A : threshold constant• E : threshold exponent

Loose coordination suffices to calculate global coflow sizes• Slaves make independent decisions

in between

Small coflows (smaller than E1A) do not experience coordination overheads!

E2A

0

Highest-PriorityQueue

Lowest-PriorityQueue

QK

Q2

Q1FIFO

FIFO

FIFO

Page 46: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Closely Approximates Varys [Sim. & EC2]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

Varys Fair FIFO Priority FIFO-LM NC0

5

10

15

20

25

1.003.21

5.65 5.53

22.07

1.10

Ove

rhead O

ver

Vary

s

Varysw/o Complete

Knowledge

Varys1 4

Per-FlowFairness

Per-FlowPrioritization

2,3

Page 47: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Orchestra

SIGCOMM’11

VarysSIGCOMM’14

My Contributions

FairCloud

SIGCOMM’12

HARPSIGCOMM’12

ViNEYard

ToN’12

SinbadSIGCOMM’13

SparkNSDI’12

DatacenterResource Allocation

Network-AwareApplications

Application-Aware Network Scheduling

Top-Level Apache Project

Merged at Facebook

Merged with Spark

Open-Source

@HP @Microsoft Bing

Open-Source

AaloSIGCOMM’15Open-Source

Coflow

Page 48: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Communication-First Big Data Systems

In-Datacenter Analytics• Cheaper SSDs and DRAM, proliferation of optical networks, and

resource disaggregation will make network the primary bottleneck

Inter-Datacenter Analytics• Bandwidth-constrained wide-area networks

End User Delivery• Faster and responsive delivery of analytics results over the

Internet for better end user experience

Page 49: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

NetworkingSystems MIND THE GAPME

Better capture application-level performance goals using coflows

Coflows improve application-level performance and usability• Extends networking and scheduling literature

Coordination – even if not free – is worth paying for in many cases

[email protected]://mosharaf.com

Page 50: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley
Page 51: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Improve Flow Completion Times

Datacenter

1

2

3

1

2

3

r1

r2

s1

s2

s3

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletion

Time = 3.66

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Smallest-Flow First1,2

time2 4 6

Link to r2

Link to r1

ShuffleCompletionTime = 6

Avg. FlowCompletion

Time = 2.66

1

1 6

42

2

Page 52: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

1E+06 1E+08 1E+10 1E+12 1E+140

0.20.40.60.8

1

Coflow Length (Bytes)

Fra

c. o

f C

oflow

s

1E+00 1E+04 1E+08 1E+120

0.20.40.60.8

1

Coflow Width (Number of Flows)

Fra

c. o

f C

oflow

s

1E+06 1E+08 1E+10 1E+12 1E+140

0.20.40.60.8

1

Coflow Size (Bytes)

Fra

c. o

f C

oflow

s

1E+06 1E+08 1E+10 1E+12 1E+140

0.20.40.60.8

1

Coflow Bottleneck Size (Bytes)

Fra

c. o

f C

oflow

s

Distributions of Coflow Characteristics

Page 53: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Traffic Sources1.Ingest and replicate new

data 2.Read input from remote

machines, when needed3.Transfer intermediate data4.Write and replicate output

30

1446

10Percentage of Traffic by Category atFacebook

Page 54: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Distribution of Shuffle Durations

PerformanceFacebook jobs spend ~25% of runtime on average in intermediate comm.

Month-long trace from a 3000-machine MapReduce production cluster at Facebook

320,000 jobs150 Million tasks

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Fraction of Runtime Spent in Shuffle

Fra

cti

on o

f Jo

bs

Page 55: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Theoretical Results

Structure of optimal schedules• Permutation schedules might not always lead to the optimal

solution

Approximation ratio of COSS-CR• Polynomial-time algorithm with constant approximation ratio (

)1

The need for coordination• Fully decentralized schedulers can perform arbitrarily worse than

the optimal

1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014

643

Page 56: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

The Coflow API

• register• put• get• unregister

mappers

reducers

shuffl

e

driver (JobTracker)

bro

adca

st @mapperb.get(id)…

@reducers.get(idsl)…

@driverb register(BROADCAST, numFlows)

id b.put(content, size)…

ids1 s.put(content, size) …

s.unregister()b.unregister()

1. NO changes to user jobs

2. NO storage management

s register(SHUFFLE, numFlows, {b})

Page 57: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

1. Admission controlDo not admit any coflows that cannot be completed within deadline without violating existing deadlines

2. Allocation algorithm

Allocate minimum required resources to each coflow to finish them at their deadlines

VarysEmploys a two-step algorithm to support coflow deadlines

Page 58: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

More Predictable

EC2 Deployment

Facebook Trace Simulation

Met Deadline Not Admitted Missed Deadline

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012

Varys Fair EDF0

25

50

75

100

% o

f C

ofl

ow

s

Varys 1

(Earliest-Deadline First) Varys Fair0

25

50

75

100

% o

f C

ofl

ow

s

Varys

Page 59: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Spark-v1.1.1 6

# Comm. Params*

10

20

Hadoop-v1.2.1

YARN-v2.6.0

OptimizingCommunicatio

nPerformance:

SystemsApproach

“Let users figure it out”*Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/RPC parameters.

Page 60: Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Experimental Methodology

Varys deployment in EC2 • 100 m2.4xlarge machines• Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps

NIC• ~900 Mbps/machine during all-to-all communication

Trace-driven simulation• Detailed replay of a day-long Facebook trace (circa October

2010)• 3000-machine,150-rack cluster with 10:1 oversubscription