mesos: a platform for fine-grained resource sharing in the data...

Mesos: A Platform for Fine-Grained Resource Sharing in the Data

Center

Ion Stoica, UC Berkeley

Joint work with: D. Borthakur, K. Elmeleegy, A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, J. Sen Sarma,

S. Shenker, M. Zaharia

Challenge   Rapid innovation in cloud computing

  No single framework optimal for all applications   Running each framework on its dedicated cluster

  Expensive   Hard to share data

2

Dryad

Pregel

Cassandra Hypertable

Need to run multiple frameworks on same cluster

Computation Model   Each framework (e.g., Hadoop) manages and runs

one or more jobs

  A job consists of one or more tasks

  A task (e.g., map, reduce) consists of one or more processes running on same machine

  Today’s, a framework assume it controls the cluster   Implement sophisticated schedulers, e.g., Quincy, LATE, Fair

sharing, Maui, … 3

Our Solution: Mesos   “Operating system” for very large clusters (tens of

thousands nodes)

  Provides   Fine grained sharing of resource across frameworks   Execution environment for tasks   Isolation

  Focus of this talk: resource management and scheduling

4

Approach: Global Scheduler

  Advantages: can achieve optimal schedule   Disadvantages:

  Complexity hard to scale and ensure resilience   Hard to anticipate future’s frameworks’ requirements   Need to refactor existing frameworks

5

Global Scheduler

Organization policies

Resource availability

Job/tasks estimated running times

Fwk/Job requirements Task schedule

Our Approach: Distributed Scheduler

  Advantages:   Simple easier to scale and make resilient   Easy to port existing frameworks, support new ones

  Disadvantages:   Distributed scheduling decision not optimal

6

Master (Global

Scheduler)

Organization policies

Resource availability

Framework scheduler

Task schedule

Fwk schedule

Framework scheduler

Framework scheduler

Resource Offers   Resource offer: set of available resources on a node

  E.g., m1: <1 cpu, 1GB>, m2: <4 cpu, 16GB>

  Master sends resource offers to frameworks   Frameworks select which offers to accept and which

tasks to run

7

Mesos Architecture

8

Slave 1

Hadoop Executor MPI executor

Slave 2

Hadoop Executor

Slave 3

Mesos Master

Alloca9on Module

Framework Scheduler Hadoop JobTracker

Framework Scheduler MPI Scheduler

Resource

offers Status

Slaves con9nuously send status updates about resources

Pluggable Scheduler to pick framework to send an offer to

Framework Scheduler selects resources and

provides tasks

Framework executors launch tasks and may persist across tasks

Launch Hadoop task 2

task 1

task 2

task 1

Outline

  Dominant Resource Fairness

  Delay scheduling – dealing with constraints

9

Problem

  Different types of resources

  Frameworks (users) prioritize resources differently   E.g., I/O bounded vs. CPU bounded jobs

10

How to allocate resources fairly?

Model   Demand vectors to express user’s requests

  D1=<2, 3, 1>, U1 tasks need 2 units of R1, 3 units of R2 and one unit of R3

  Users get resources in multiples of their demand vector   Resource offer = k x demand vector, k > 0

  Note: demand vector not needed in practice   Use actual allocation for scheduling decisions

11

Problem Example

  30 CPU and 30 GB RAM   U1 needs <1 CPU, 1 GB RAM> per task   U2 needs <1 CPU, 3 GB RAM> per task

12

How many tasks should each task run?

Strawman: Asset Fairness

  Equalize sum of allocated resources


  Allocation   U1: 12 tasks: 12 CPUs, 12 GB (∑=24)   U2: 6 tasks: 6 CPUs, 18 GB (∑=24)

13

CPU

User 1 User 2 100%

50%

0% RAM

User 1 has less than 50% of each resource !

Share Guarantee Property

  No incentive to share:   User 1 is better off just taking half of cluster!

  Share guaranteed property:   Every user should get 1/n of at least one resource   “You shouldn’t be worse off than if you ran your own

cluster with 1/n of the resources”

14

Dominant Resource Fairness

  Dominant resource: resource the user has the largest share of, e.g.,   Total capacity: <10, 4>   Demand vector: <2, 1>

  Max-min fairness across dominant resources   If infinite demand, equalize dominant resource shares

15

dominant resource, R2

Satisfies share guarantee property

DRF Example


  Allocation   U1: 10 tasks: 20 CPUs (67%), 10 GB   U2: 5 tasks: 5 CPUs, 20 GB (67%)

16

CPU

User 1 User 2 100%

50%

0% RAM

Other Desirable Properties

  Strategy proofness   Pareto efficiency   Envy-freeness   Single-resource fairness   Bottleneck fairness   Population monotonicity   Resource monotonicity

17

Related Work

  Resource allocation a very popular topic in micro-economic theory

  Competitive Equilibrium from Equal Incomes (CEEI) [Varian74]   Give each user 1/n of every resource   Let users trade in a perfectly competitive market until

Pareto efficiency   Economists’ main method of fair allocation [Young94]

18

DRF vs. CEEI   U1=<16,1> U2=<1,2>

19

CPU RAM CPU RAM

CPU RAM

user 2 user 1 100%

50%

0% CPU RAM

100%

50%

0%

DRF CEEI

DRF more fair, CEEI better utilization

Strategy Proofness   A framework cannot increase its share by lying

  DRF is strategy proof   Allocation is determined by dominant share   Increasing non-dominant share can only hurt

  CEEI is not strategy proof

  In general, schedulers that maximize utilization can be gamed   A user with demand matching available resources is picked

20

Strategy Proof: DRF vs CEEI (1/2)   U1=<16,1> U2=<1,2>

CPU RAM

user 2 user 1 100%

50%

0% CPU RAM

100%

50%

0%

DRF CEEI

21

Strategy Proof: DRF vs CEEI (2/2)   U1=<16,4> U2=<1,2>

  With CEEI a user can increase its dominant share by lying

CPU RAM

user 2 user 1 100%

50%

0% CPU RAM

100%

50%

0%

DRF CEEI

22

Property Summary

  Several other schedulers evaluated… 23

Property Asset CEEI DRF Share guarantee ✔ ✔ Strategy proofness ✔ Pareto efficiency ✔ ✔ ✔ Envy-free ✔ ✔ ✔ Single resource fairness ✔ ✔ ✔ Bottleneck res. fairness ✔ ✔ Population monotonicity ✔ ✔ Resource monotonicity

Dynamic Resource Sharing with DRF

24

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 31 61 91 121 151 181 211 241 271 301 331

Share of Cluster

Time (s)

MPI

Hadoop

Spark

Outline

  Dominant Resource Fairness

  Delay scheduling – dealing with constraints

25

Fair Allocation Hurts Locality (1/2)

26

Fwk 2 Master Fwk 1

Scheduling order

Slave Slave Slave Slave Slave Slave 4 2

1 1 2 2 3 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1

Task 2 Task 5 Task 3 Task 1 Task 7 Task 4

File 1: File 2:

Fair Allocation Hurts Locality (2/2)

27

Fwk 1 Master Fwk 2

Scheduling order


1 2 2 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1

Task 2 Task 5 Task 3 Task 1

File 1: File 2:

Task 1 Task 7 Task 2 Task 4 Task 3

1 3

Delay Scheduling

  Relax scheduling: frameworks wait for a limited time if they cannot launch local tasks

  Result: Waiting a few seconds is enough to get nearly 100% locality

28

Delay Scheduling Example

29

Job 1 Master Job 2

Scheduling order


1 1 2 2 3 3 9 5 3 3 6 7 5 6 9 4 8 7 8 2 1 1

Task 2 Task 3

File 1:

File 2:

Task 8 Task 7 Task 2 Task 4 Task 6 Task 5 Task 1 Task 1 Task 3

Wait!

Idea: Wait a short time to get data-local scheduling opportunities

How Long to Wait?   Model: Poisson arrival of data-local offers

  Expected response time gain if waiting time w

  Typical values   t : (avg. task length)/((tasks per node)x(file replication))   d >= (local avg. task length)

30

G(w) = (1 – e-w/t)×(d – t)

Expected time to get a local offer

Delay from running non-locally

Optimal wait time is infinity

Locality and Execution Time   Running 16 Hadoop instances

31

Static partitioning

Mesos, no delay sched.

Mesos, delay sched., 1sec


Static partitioning

Mesos, no delay sched.



Summary   Challenge: allow frameworks to share a cluster

  Mesos: “OS” for sharing clusters   Support existing frameworks   Fault-tolerant & scale to 10,000s of machines   Delay scheduling part of Hadoop distribution   Ongoing deployments at Twitter and Facebook

  Open source: github.com/mesos

  Research challenges   Fairness for constrained & multiple resource types   Distributed vs. centralized schedulers 32

mesos: a platform for fine-grained resource sharing in the data...

Documents