mesos: a platform for fine-grained resource sharing in the data...
TRANSCRIPT
Mesos: A Platform for Fine-Grained Resource Sharing in the Data
Center
Ion Stoica, UC Berkeley
Joint work with: D. Borthakur, K. Elmeleegy, A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, J. Sen Sarma,
S. Shenker, M. Zaharia
Challenge Rapid innovation in cloud computing
No single framework optimal for all applications Running each framework on its dedicated cluster
Expensive Hard to share data
2
Dryad
Pregel
Cassandra Hypertable
Need to run multiple frameworks on same cluster
Computation Model Each framework (e.g., Hadoop) manages and runs
one or more jobs
A job consists of one or more tasks
A task (e.g., map, reduce) consists of one or more processes running on same machine
Today’s, a framework assume it controls the cluster Implement sophisticated schedulers, e.g., Quincy, LATE, Fair
sharing, Maui, … 3
Our Solution: Mesos “Operating system” for very large clusters (tens of
thousands nodes)
Provides Fine grained sharing of resource across frameworks Execution environment for tasks Isolation
Focus of this talk: resource management and scheduling
4
Approach: Global Scheduler
Advantages: can achieve optimal schedule Disadvantages:
Complexity hard to scale and ensure resilience Hard to anticipate future’s frameworks’ requirements Need to refactor existing frameworks
5
Global Scheduler
Organization policies
Resource availability
Job/tasks estimated running times
Fwk/Job requirements Task schedule
Our Approach: Distributed Scheduler
Advantages: Simple easier to scale and make resilient Easy to port existing frameworks, support new ones
Disadvantages: Distributed scheduling decision not optimal
6
Master (Global
Scheduler)
Organization policies
Resource availability
Framework scheduler
Task schedule
Fwk schedule
Framework scheduler
Framework scheduler
Resource Offers Resource offer: set of available resources on a node
E.g., m1: <1 cpu, 1GB>, m2: <4 cpu, 16GB>
Master sends resource offers to frameworks Frameworks select which offers to accept and which
tasks to run
7
Mesos Architecture
8
Slave 1
Hadoop Executor MPI executor
Slave 2
Hadoop Executor
Slave 3
Mesos Master
Alloca9on Module
Framework Scheduler Hadoop JobTracker
Framework Scheduler MPI Scheduler
Resource
offers Status
Slaves con9nuously send status updates about resources
Pluggable Scheduler to pick framework to send an offer to
Framework Scheduler selects resources and
provides tasks
Framework executors launch tasks and may persist across tasks
Launch Hadoop task 2
task 1
task 2
task 1
Outline
Dominant Resource Fairness
Delay scheduling – dealing with constraints
9
Problem
Different types of resources
Frameworks (users) prioritize resources differently E.g., I/O bounded vs. CPU bounded jobs
10
How to allocate resources fairly?
Model Demand vectors to express user’s requests
D1=<2, 3, 1>, U1 tasks need 2 units of R1, 3 units of R2 and one unit of R3
Users get resources in multiples of their demand vector Resource offer = k x demand vector, k > 0
Note: demand vector not needed in practice Use actual allocation for scheduling decisions
11
Problem Example
30 CPU and 30 GB RAM U1 needs <1 CPU, 1 GB RAM> per task U2 needs <1 CPU, 3 GB RAM> per task
12
How many tasks should each task run?
Strawman: Asset Fairness
Equalize sum of allocated resources
30 CPU and 30 GB RAM U1 needs <1 CPU, 1 GB RAM> per task U2 needs <1 CPU, 3 GB RAM> per task
Allocation U1: 12 tasks: 12 CPUs, 12 GB (∑=24) U2: 6 tasks: 6 CPUs, 18 GB (∑=24)
13
CPU
User 1 User 2 100%
50%
0% RAM
User 1 has less than 50% of each resource !
Share Guarantee Property
No incentive to share: User 1 is better off just taking half of cluster!
Share guaranteed property: Every user should get 1/n of at least one resource “You shouldn’t be worse off than if you ran your own
cluster with 1/n of the resources”
14
Dominant Resource Fairness
Dominant resource: resource the user has the largest share of, e.g., Total capacity: <10, 4> Demand vector: <2, 1>
Max-min fairness across dominant resources If infinite demand, equalize dominant resource shares
15
dominant resource, R2
Satisfies share guarantee property
DRF Example
30 CPU and 30 GB RAM U1 needs <2 CPU, 1 GB RAM> per task U2 needs <1 CPU, 4 GB RAM> per task
Allocation U1: 10 tasks: 20 CPUs (67%), 10 GB U2: 5 tasks: 5 CPUs, 20 GB (67%)
16
CPU
User 1 User 2 100%
50%
0% RAM
Other Desirable Properties
Strategy proofness Pareto efficiency Envy-freeness Single-resource fairness Bottleneck fairness Population monotonicity Resource monotonicity
17
Related Work
Resource allocation a very popular topic in micro-economic theory
Competitive Equilibrium from Equal Incomes (CEEI) [Varian74] Give each user 1/n of every resource Let users trade in a perfectly competitive market until
Pareto efficiency Economists’ main method of fair allocation [Young94]
18
DRF vs. CEEI U1=<16,1> U2=<1,2>
19
CPU RAM CPU RAM
CPU RAM
user 2 user 1 100%
50%
0% CPU RAM
100%
50%
0%
DRF CEEI
DRF more fair, CEEI better utilization
Strategy Proofness A framework cannot increase its share by lying
DRF is strategy proof Allocation is determined by dominant share Increasing non-dominant share can only hurt
CEEI is not strategy proof
In general, schedulers that maximize utilization can be gamed A user with demand matching available resources is picked
20
Strategy Proof: DRF vs CEEI (1/2) U1=<16,1> U2=<1,2>
CPU RAM
user 2 user 1 100%
50%
0% CPU RAM
100%
50%
0%
DRF CEEI
21
Strategy Proof: DRF vs CEEI (2/2) U1=<16,4> U2=<1,2>
With CEEI a user can increase its dominant share by lying
CPU RAM
user 2 user 1 100%
50%
0% CPU RAM
100%
50%
0%
DRF CEEI
22
Property Summary
Several other schedulers evaluated… 23
Property Asset CEEI DRF Share guarantee ✔ ✔ Strategy proofness ✔ Pareto efficiency ✔ ✔ ✔ Envy-free ✔ ✔ ✔ Single resource fairness ✔ ✔ ✔ Bottleneck res. fairness ✔ ✔ Population monotonicity ✔ ✔ Resource monotonicity
Dynamic Resource Sharing with DRF
24
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 31 61 91 121 151 181 211 241 271 301 331
Share of Cluster
Time (s)
MPI
Hadoop
Spark
Outline
Dominant Resource Fairness
Delay scheduling – dealing with constraints
25
Fair Allocation Hurts Locality (1/2)
26
Fwk 2 Master Fwk 1
Scheduling order
Slave Slave Slave Slave Slave Slave 4 2
1 1 2 2 3 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1
Task 2 Task 5 Task 3 Task 1 Task 7 Task 4
File 1: File 2:
Fair Allocation Hurts Locality (2/2)
27
Fwk 1 Master Fwk 2
Scheduling order
Slave Slave Slave Slave Slave Slave 4 2
1 2 2 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1
Task 2 Task 5 Task 3 Task 1
File 1: File 2:
Task 1 Task 7 Task 2 Task 4 Task 3
1 3
Delay Scheduling
Relax scheduling: frameworks wait for a limited time if they cannot launch local tasks
Result: Waiting a few seconds is enough to get nearly 100% locality
28
Delay Scheduling Example
29
Job 1 Master Job 2
Scheduling order
Slave Slave Slave Slave Slave Slave 4 2
1 1 2 2 3 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1
Task 2 Task 3
File 1:
File 2:
Task 8 Task 7 Task 2 Task 4 Task 6 Task 5 Task 1 Task 1 Task 3
Wait!
Idea: Wait a short time to get data-local scheduling opportunities
How Long to Wait? Model: Poisson arrival of data-local offers
Expected response time gain if waiting time w
Typical values t : (avg. task length)/((tasks per node)x(file replication)) d >= (local avg. task length)
30
G(w) = (1 – e-w/t)×(d – t)
Expected time to get a local offer
Delay from running non-locally
Optimal wait time is infinity
Locality and Execution Time Running 16 Hadoop instances
31
Static partitioning
Mesos, no delay sched.
Mesos, delay sched., 1sec
Mesos, delay sched., 5sec
Static partitioning
Mesos, no delay sched.
Mesos, delay sched., 1sec
Mesos, delay sched., 5sec
Summary Challenge: allow frameworks to share a cluster
Mesos: “OS” for sharing clusters Support existing frameworks Fault-tolerant & scale to 10,000s of machines Delay scheduling part of Hadoop distribution Ongoing deployments at Twitter and Facebook
Open source: github.com/mesos
Research challenges Fairness for constrained & multiple resource types Distributed vs. centralized schedulers 32