network-aware scheduling for data-parallel jobs: plan when you can virajith jalaparti peter bodik,...
TRANSCRIPT
Network-Aware Schedulingfor Data-Parallel Jobs: Plan When You Can
Virajith Jalaparti
Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew Caesar
2
Network scheduling important for data-parallel jobs
• Network-intensive stages (e.g., shuffle, join)– More than 50% time spent in network transfers*
• Oversubscribed network from rack to core– Ratios between 3:1 to 10:1
• Cross-rack bandwidth shared across apps– Nearly 50% used for background transfers**
*Efficient Coflow Scheduling with Varys, Sigcomm 2014.**Leveraging Endpoint Flexibility in Data-Intensive Clusters, Sigcomm 2013
3
Several techniques proposed
• Focus on placing tasks (e.g., Tetris, Quincy)• Focus on scheduling network flows (e.g., Varys,
Baraat, D3)
M
M
M
M
R
R
MapReduce job
Maps Reducers
Input
Limitation: Assume fixed input data placement
4
Limitation of existing techniques
= Map in
M M M M
R R
MapReduce job
M
M M M
= Map out/
reduce in
R R
Problems- Use congested cross-rack
links - Contention with other
jobs
• Map input data spread randomly (HDFS)
• Hence, reduce input spread randomly
Rack 1 Rack 2 Rack 3 Rack 4
5
Our proposal: Place input in a few racks
6
Our proposal: Place input in a few racks
All transfers stay within the rack: Rack-level locality
= Map in
M
M M
M
R
R
= Map out/
reduce in
M M M M
R R
MapReduce job
• Map input data placed in one rack
• Reduce input in the same rack
Rack 1 Rack 2 Rack 3 Rack 4
7
Our proposal: Place input in a few racks
• High bandwidth between tasks• Reduced contention across jobs
• Scenarios– Recurring jobs (~40%) known ahead
of time– Separate storage and compute
cluster
Is placing data
feasible?
Benefits?
8
Challenges
Need more than 1 rack for 5-25%
jobs
One rack may be sufficient for 75-95%
jobs
Offline planningusing job properties
Can be predicted with low error for recurring jobs (~6.5% on average)
• How many racks to assign a job?
• How to avoid hotspots?
• How to determine job characteristics?
• What about jobs without history?– Ad hoc jobs
Use history
Benefit from freed up
resourcesDay
s
Input
data
si
ze(l
og 1
0
scale
)
1 2 3 4 5 6 7 8 9 10
9
Our system: Corral
Coordinated placement of data and compute
Exploits predictable characteristics of jobs
10
Corral architecture
Cluster scheduler
Offline planner
Future job estimates
Data placement policy
Task placement policy
Data upload for submitted
Offline
Online
Placement hints ,
CORRAL
• Solves an offline planning problem• Data placement: One data replica constrained to
– Other replicas spread randomly for fault tolerance
• Task placement: Tasks assigned slots only in ; ties broken using
11
Outline• Motivation• Architecture• Planning problem
– Formulation– Solution
• Evaluation• Conclusion
12
Planning problem: Formulation
• Scenarios– Batch (minimize makespan)– Online (minimize average job time)
Given a set of jobs and their arrival times,
find a schedule which meets their goals
,
Focus on batch scenario in this talk
13
Planning problem: Solution
Provisioning phase Prioritization phase
How many racks to allocate to each job?
Which rack(s) to allocate to each job?
How many racks to allocate to each job?
Which rack(s) to allocate to each job?
Increase #racks of
longest job
All jobs assigned one rack
Schedule on cluster
Iterat
e
Select schedule with minimum makespan
Initialize
14
Planning problem: Solution
Select schedule with minimum makespan
Time
1
2
3
Rack
s
…
…
…
…
…
Cluster
12550
100
300
12550
250
12550
125
50
100
Job latency determined using latency-response curves
# racks
Late
ncy
Performs at most 3%-15% worse than optimal
Jobs
A
B
C
Makespan
Iter=1
1
1
1300s
Iter=2
2
1
1250s
Iter=3
3
1
1225s
Schedule widest-job first; ties broken using longest-
job first
Longest-job first can lead to wasted resources• Planning assumption: Jobs “rectangular”, exclusive
use of racks• Work-conserving at runtime
15
Outline• Motivation• Architecture• Planning problem
– Formulation– Solution
• Evaluation• Conclusion
16
Evaluation• (Recurring) MapReduce workloads
• Mix of recurring and ad hoc jobs
• DAG-style workloads
• Sensitivity analysis
• Benefits with flow-level schedulers
17
Evaluation: Setup• Implemented Corral in Yarn
– Modified HDFS and Resource Manager
• Cluster– 210 machines, 7 racks– 10Gbps/machine, 5:1 oversubscription– Background traffic (~50% of cross-rack bandwidth)
• Baselines– Yarn-CS: Capacity scheduler in Yarn– ShuffleWatcher [ATC’14]: Schedules to min. cross-
rack data transferred
18
Evaluation: MapReduce workloads
Corral reduces makespan by 10-33%
Workload # of jobs
Quantcast (W1) 200
Yahoo (W2) 400
Microsoft Cosmos (W3)
200
Improved network locality
Avg. reducer time (sec)
Reduced contention
Reasons
~40%
~42%
19
Evaluation: Mix of jobs• 100 recurring jobs and 50 ad-hoc
jobs from W1Recurring jobs
Ad-hoc jobs
Recurring jobs finish faster and free up resources
~27%
~42%
~10%
~37%
20
Corral summary• Exploit predictable characteristics of data-
parallel jobs
• Place data and compute together in a few racks
• Up to 33% (56%) reduction in makespan (avg. job time)– Provides orthogonal benefits to flow-level
techniques