natjam : supporting deadlines and priorities in a mapreduce cluster
Post on 24-Feb-2016
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
NATJAM: SUPPORTING DEADLINES AND PRIORITIES IN A MAPREDUCE CLUSTER
Brian Cho (Samsung/Illinois), Muntasir Rahman, Tej Chajed, Indranil Gupta,
Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin
University of Illinois (Urbana-Champaign)
1
Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu
Hadoop Jobs have Priorities• Dual Priority Case– Production jobs (high
priority)• Time sensitive• Directly affect criticality
or revenue– Research jobs (low
priority)• e.g., long term analysis
• Example: Ad provider
Count clicks
Update ads
Slow counts → Show old ads → Don’t get
paid $$$
Ad click-through logs
Is there a better way to place ads?
Run machine learning analysis
Daily and Historical logs. 2
Prioritize production jobs
http://dprg.cs.uiuc.edu
State-of-the-art: Separate clusters• Production cluster receives production jobs (high priority)• Research cluster receives research jobs (low priority)
• Traces reveal large periods of under-utilization in each cluster– Long job completion times– Human involvement in job management
• Goal: single consolidated cluster for all priorities and deadlines– Prioritize production jobs and yet affect research jobs least
• Today’s Options:– Wait for research tasks to finish (e.g., Capacity Scheduler)
Prolongs production jobs– Kill research tasks (e.g., Fair Scheduler) can lead to repeated work
Prolongs research jobs
3http://dprg.cs.uiuc.edu
Natjam’s Techniques1. Scale down research jobs by– Preempting some Reduce tasks– Fast on-demand automated checkpointing of task state– Later, reduces can resume where they left off
• Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])
2. Job Eviction Policies3. Task Eviction Policies
4http://dprg.cs.uiuc.edu
Natjam built into Hadoop YARN Architecture
• Preemptor– Chooses Victim
Job– Reclaims queue
resources• Releaser
– Chooses Victim Task
• Local Suspender– Saves state of
Victim Task5
Resource ManagerCapacity Scheduler
Node A
(empty container)
Node BNode Manager A
Application Master 1
Node Manager B
Application Master 2
Task (App2)
Preemptor
Releaser
Task (App2)
Local Suspender
Releaser Local Suspender
preempt()
# containers to release
release()suspend
saved state
ask container
Task (App1)
resume()
http://dprg.cs.uiuc.edu
Suspending and Resuming Tasks• Existing intermediate data
used– Reduce inputs,
stored at local host– Reduce outputs,
stored on HDFS• Suspended task state saved
locally, so resume can avoid network overhead
• Checkpoint state saved– Key counter– Reduce input path– Hostname– List of suspended task attempt
IDs
6
HDFSTask Attempt 1
Inputs
KeyCounter
tmp/task_att_1
tmp/task_att_2
outdir/
(Resumed) Task Attempt 2
Inputs
KeyCounter
(skip)
(Suspended)Container freed,
Suspend state saved
http://dprg.cs.uiuc.edu
Two-level Eviction Policies• On a container
request in a full cluster:
1. Job Eviction– @Preemptor
2. Task Eviction– @Releaser
7
Resource ManagerCapacity Scheduler
Node A Node BNode Manager A
Application Master 1
Node Manager B
Application Master 2
Task (App2)
Preemptor
Releaser
Task (App2)
Local Suspender
Releaser Local Suspender
# containers to release
preempt()
release()
http://dprg.cs.uiuc.edu
Job Eviction Policies• Based on total amount of resources (e.g., containers) held by
victim job (known at Resource Manager)
1. Least Resources (LR) Large research jobs unaffected Starvation for small research jobs (e.g., repeated production arrivals)
2. Most Resources (MR) Small research jobs unaffected Starvation for the largest research job
3. Probabilistically-weighted on Resources (PR) Weigh jobs by number of containers: treats all tasks same, across jobs Affects multiple research jobs
8http://dprg.cs.uiuc.edu
Task Eviction Policies• Based on time remaining (known at Application Master)
1. Shortest Remaining Time (SRT) Leaves the tail of research job alone Holds on to containers that would be released soon
2. Longest Remaining Time (LRT) May lengthen the tail Releases more containers earlier
• However: SRT provably optimal under some conditions– Counter-intuitive. SRT = Longest-job-first scheduling.
9http://dprg.cs.uiuc.edu
Now
Eviction Policies in Practice• Task Eviction– SRT 20% faster than LRT for research jobs– Production job similar across SRT vs. LRT– Theorem: When research tasks resume simultaneously,
SRT results in shortest job completion time.• Job Eviction– MR best– PR very close behind– LR 14%-23% worse than MR
• MR + SRT best combination
http://dprg.cs.uiuc.edu 10
Natjam-R: Multiple Priorities• Special case of priorities: jobs with real-time deadlines• Best-effort only (no admission control)• Resource Manager keeps single queue of jobs sorted by
increasing priority (derived from deadline)– Periodically scans queue: evicts later job to give to earlier waiting job
• Job Eviction Policies1. Maximum Deadline First (MDF): Priority = Deadline
Prefers short deadline jobs May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline
2. Maximum Laxity First– Priority = Laxity = Deadline minus Job’s Projected Completion time Pays attention to job’s resource requirements
11
MDF vs. MLF in
Practice
0 50 100 150 200 250 3000
102030405060708090
100
Job 1 MapJob 2 MapJob 3 MapJob 1 ReduceJob 2 ReduceJob 3 Reduce
time (s)
Prog
ress
0 50 100 150 200 250 3000
102030405060708090
100
Job 1 MapJob 2 MapJob 3 MapJob 1 ReduceJob 2 ReduceJob 3 Reduce
time (s)
Prog
ress
MLF moves in lockstepMisses all deadlines
MDF prefers short deadlines
Job deadlines
• 8 node cluster• Yahoo! trace experiments in paper
Natjam vs. Alternatives
13Ideal Capacity scheduler:
Hard capCapacity scheduler:
Soft capKilling Natjam
0
50
100
150
200
250
300
350
Research-XL Job Production-S Job
Aver
age
Exec
ution
Tim
e (s
econ
ds)
50% worse than ideal
90% worse than ideal
20% worse than ideal
2% worse than ideal15% better than Killing
7% worse than Ideal40% better than Soft cap
time (seconds)
t=0s Research-XL(100% of cluster)
t=50s Production-S(25% of cluster)
Microbenchmark: • 7 node cluster
Empty Cluster
Large Experiments• 250 nodes @Yahoo!, Driven by Yahoo! traces• Natjam vs. Waiting for research tasks (Hadoop Capacity
Scheduler: Soft cap) – Production jobs: 53% benefit, 97% delayed < 5 s– Research jobs: 63% benefit, very few outliers (low starvation)
• Natjam vs. Killing research tasks– Production jobs: largely unaffected– Research jobs:
• 38% finish faster than 100 s• 5th percentile faster than 750 s• Biggest improvement: 1880 s• Negligible starvation http://dprg.cs.uiuc.edu
14
Related Work• Single cluster job scheduling has focused on:– Locality of Map tasks [Quincy, Delay Scheduling]– Speculative execution [LATE Scheduler]– Average fairness between queues [Capacity
Scheduler, Fair Scheduler]– Recent work: Elastic queues but uses Sailfish – needs
special intermediate file system, does not work with Hadoop [Amoeba]
– Mapreduce-5269 JIRA: Preemption in Hadoop
15http://dprg.cs.uiuc.edu
Takeaways• Natjam supports dual priority and arbitrary
priorities (derived from deadlines)• SRT (Shortest remaining time) best policy for task
eviction• MR (Most resources) best policy for job eviction• MDF (Maximum deadline first) best policy for job
eviction in Natjam-R• 2-7% Overhead for dual priority case• Please see our poster + demo video later today!
http://dprg.cs.uiuc.edu 16
http://dprg.cs.uiuc.edu 17
Backup slides
http://dprg.cs.uiuc.edu 18
Contributions
• Our system Natjam allows us to – Maintain one cluster– With a production queue and a research queue– Prioritize production jobs and complete them
quickly– While affecting research jobs the least– (Later: Extend to multiple priorities.)
http://dprg.cs.uiuc.edu 19
Hadoop 23’s Capacity Scheduler• Limitation: research jobs
cannot scale down• Hadoop capacity shared
using queues– Guaranteed capacity (G)– Maximum capacity(M)
• Example– Production (P) queue:
G 80%/M 80%– Research (R) queue:
G 20%/M 40%
1. Production jobsubmitted first:
2. Research jobsubmitted first:
time →
P takes 80%(under-utilization)
R can only grow to 40%
time →
R takes 40%(under-utilization)
P cannot grow beyond 60%
http://dprg.cs.uiuc.edu 20
Natjam Scheduler• Does not require
Maximum capacity• Scales down research jobs
by – Preempting Reduce tasks– Fast on-demand automated
checkpointing of task state– Resumption where it left off
• Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])
1. P/R Guaranteed: 80%/20%
2. P/R Guaranteed: 100%/0%
time →
R takes 100%
P takes 80%
time →
R takes 100%
P takes 100%
Prioritize Production Jobs
Yahoo! Hadoop Traces:CDF of differences (negative is good)
21
-2000 -1500 -1000 -500 0 5000
0.10.20.30.40.50.60.70.80.9
1
Production Jobs: Natjam - KillingResearch Jobs: Natjam - Killing
Difference in Job Completion Time (seconds)
CDF
-300 -200 -100 0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
Production Jobs: Natjam - Soft CapResearch Jobs: Natjam - Soft Cap
Difference in Job Completion Time (seconds)
CDF
-2500 -2000 -1500 -1000 -500 0 5000
0.10.20.30.40.50.60.70.80.9
1
Production Jobs: Natjam - KillingResearch Jobs: Natjam - Killing
Difference in Job Completion Time (seconds)
CDF
-150 -100 -50 0 50 100 1500
0.10.20.30.40.50.60.70.80.9
1
Production Jobs: Natjam - Soft CapResearch Jobs: Natjam - Soft Cap
Difference in Job Completion Time (seconds)
CDF
7-node cluster
250-node Yahoo! cluster
Only two starved jobs 260 s and 390 sLargest benefit
1880 s
top related