network-aware scheduling for data-parallel jobs: plan when you can virajith jalaparti peter bodik,...

20
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew Caesar

Upload: theresa-wilkinson

Post on 17-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

Network-Aware Schedulingfor Data-Parallel Jobs: Plan When You Can

Virajith Jalaparti

Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew Caesar

Page 2: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

2

Network scheduling important for data-parallel jobs

• Network-intensive stages (e.g., shuffle, join)– More than 50% time spent in network transfers*

• Oversubscribed network from rack to core– Ratios between 3:1 to 10:1

• Cross-rack bandwidth shared across apps– Nearly 50% used for background transfers**

*Efficient Coflow Scheduling with Varys, Sigcomm 2014.**Leveraging Endpoint Flexibility in Data-Intensive Clusters, Sigcomm 2013

Page 3: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

3

Several techniques proposed

• Focus on placing tasks (e.g., Tetris, Quincy)• Focus on scheduling network flows (e.g., Varys,

Baraat, D3)

M

M

M

M

R

R

MapReduce job

Maps Reducers

Input

Limitation: Assume fixed input data placement

Page 4: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

4

Limitation of existing techniques

= Map in

M M M M

R R

MapReduce job

M

M M M

= Map out/

reduce in

R R

Problems- Use congested cross-rack

links - Contention with other

jobs

• Map input data spread randomly (HDFS)

• Hence, reduce input spread randomly

Rack 1 Rack 2 Rack 3 Rack 4

Page 5: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

5

Our proposal: Place input in a few racks

Page 6: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

6

Our proposal: Place input in a few racks

All transfers stay within the rack: Rack-level locality

= Map in

M

M M

M

R

R

= Map out/

reduce in

M M M M

R R

MapReduce job

• Map input data placed in one rack

• Reduce input in the same rack

Rack 1 Rack 2 Rack 3 Rack 4

Page 7: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

7

Our proposal: Place input in a few racks

• High bandwidth between tasks• Reduced contention across jobs

• Scenarios– Recurring jobs (~40%) known ahead

of time– Separate storage and compute

cluster

Is placing data

feasible?

Benefits?

Page 8: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

8

Challenges

Need more than 1 rack for 5-25%

jobs

One rack may be sufficient for 75-95%

jobs

Offline planningusing job properties

Can be predicted with low error for recurring jobs (~6.5% on average)

• How many racks to assign a job?

• How to avoid hotspots?

• How to determine job characteristics?

• What about jobs without history?– Ad hoc jobs

Use history

Benefit from freed up

resourcesDay

s

Input

data

si

ze(l

og 1

0

scale

)

1 2 3 4 5 6 7 8 9 10

Page 9: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

9

Our system: Corral

Coordinated placement of data and compute

Exploits predictable characteristics of jobs

Page 10: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

10

Corral architecture

Cluster scheduler

Offline planner

Future job estimates

Data placement policy

Task placement policy

Data upload for submitted

Offline

Online

Placement hints ,

CORRAL

• Solves an offline planning problem• Data placement: One data replica constrained to

– Other replicas spread randomly for fault tolerance

• Task placement: Tasks assigned slots only in ; ties broken using

Page 11: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

11

Outline• Motivation• Architecture• Planning problem

– Formulation– Solution

• Evaluation• Conclusion

Page 12: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

12

Planning problem: Formulation

• Scenarios– Batch (minimize makespan)– Online (minimize average job time)

Given a set of jobs and their arrival times,

find a schedule which meets their goals

,

Focus on batch scenario in this talk

Page 13: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

13

Planning problem: Solution

Provisioning phase Prioritization phase

How many racks to allocate to each job?

Which rack(s) to allocate to each job?

How many racks to allocate to each job?

Which rack(s) to allocate to each job?

Increase #racks of

longest job

All jobs assigned one rack

Schedule on cluster

Iterat

e

Select schedule with minimum makespan

Initialize

Page 14: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

14

Planning problem: Solution

Select schedule with minimum makespan

Time

1

2

3

Rack

s

Cluster

12550

100

300

12550

250

12550

125

50

100

Job latency determined using latency-response curves

# racks

Late

ncy

Performs at most 3%-15% worse than optimal

Jobs

A

B

C

Makespan

Iter=1

1

1

1300s

Iter=2

2

1

1250s

Iter=3

3

1

1225s

Schedule widest-job first; ties broken using longest-

job first

Longest-job first can lead to wasted resources• Planning assumption: Jobs “rectangular”, exclusive

use of racks• Work-conserving at runtime

Page 15: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

15

Outline• Motivation• Architecture• Planning problem

– Formulation– Solution

• Evaluation• Conclusion

Page 16: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

16

Evaluation• (Recurring) MapReduce workloads

• Mix of recurring and ad hoc jobs

• DAG-style workloads

• Sensitivity analysis

• Benefits with flow-level schedulers

Page 17: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

17

Evaluation: Setup• Implemented Corral in Yarn

– Modified HDFS and Resource Manager

• Cluster– 210 machines, 7 racks– 10Gbps/machine, 5:1 oversubscription– Background traffic (~50% of cross-rack bandwidth)

• Baselines– Yarn-CS: Capacity scheduler in Yarn– ShuffleWatcher [ATC’14]: Schedules to min. cross-

rack data transferred

Page 18: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

18

Evaluation: MapReduce workloads

Corral reduces makespan by 10-33%

Workload # of jobs

Quantcast (W1) 200

Yahoo (W2) 400

Microsoft Cosmos (W3)

200

Improved network locality

Avg. reducer time (sec)

Reduced contention

Reasons

~40%

~42%

Page 19: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

19

Evaluation: Mix of jobs• 100 recurring jobs and 50 ad-hoc

jobs from W1Recurring jobs

Ad-hoc jobs

Recurring jobs finish faster and free up resources

~27%

~42%

~10%

~37%

Page 20: Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew

20

Corral summary• Exploit predictable characteristics of data-

parallel jobs

• Place data and compute together in a few racks

• Up to 33% (56%) reduction in makespan (avg. job time)– Provides orthogonal benefits to flow-level

techniques