mesos - a platform for fine-grained resource sharing in the data center

MesosA Platform for Fine-Grained Resource Sharing in the Data Center

@PapersWeLoveSEA@ankurcha - Ankur Chauhan

Benjamin Hindman, Andy konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

https://twitter.com/PapersWeLoveSEA

https://twitter.com/PapersWeLoveSEA

https://twitter.com/ankurcha

https://twitter.com/ankurcha

Background

Rapid innovation in distributed/cluster computing● Hadoop, Spark, Flink, Tez, HDFS, Dryad etc● Micro-services / web-services (long running)● <Your custom framework> …

… There is a lot of wasted effort.

Problem

● Rapid innovation in cluster/dist. frameworks● No single framework is optimal for all

workloads

● What do we want?○ Run multiple frameworks on a single cluster (big)

■ Maximize utilisation of resources. (elastic)■ Share data between frameworks.

Solution

Mesos Goals

● High utilization of resources● Support diverse frameworks● Scalability to 10,000s of nodes● Reliability in face of failures● Efficient with minimal overhead● Highly available via leader election of

master.

Other benefits

● Run multiple instances of a framework○ Isolate production and experimental tasks○ Run multiple version of a framework

● Build special framework targeting a particular domain○ eg: spark

Mesos Architecture

● Resource Offers○ Offer available resources to frameworks

● 2-level scheduler○ Mesos takes the decision of which framework gets

the offer.○ Frameworks decide whether to accept/reject offer.

● Fine grained sharing○ Improved utilization, responsiveness, data locality

● Pluggable allocation module

Mesos Architecture

High level architecture

Design components

● Mesos Master○ Runs the allocation module

● Mesos Slave○ Offers resources○ task/executor isolation

● Scheduler○ Accepts/rejects resource offers

● Executor○ Runs tasks

● Task○ Unit of work

Mesos Architecture

● Making resource offers scalable and robust○ Let schedulers define filters○ Offered resources counted as allocations○ Rescind offers if not accepted within timeout

● Fault tolerance○ Zookeeper for leader election (hot-standby)○ Mesos reports slave and executor failures to

scheduler○ Minimal internal state in master

Resource offers● Mesos decides how many

resources to offer to each framework, based on allocation policy (fair share), while framework decide which resource offers to accept and which tasks to run on them.

● When a framework accepts an offer, it passes Mesos a description of the task (and executor) to launch.

Analysis

● Resource offers work well when:○ Frameworks can scale elastically.○ Task durations are homogeneous.○ Frameworks have many preferred nodes.

● Conditions hold true in many frameworks (hadoop, dryad, spark … )○ Work is divided into short tasks○ Data is replicated across multiple nodes

Limitations of distributed scheduling

● Fragmentation○ With heterogeneous resource demands, distributed

collection of frameworks may not be able to bin pack optimally.

● Independent framework constraints○ Esoteric dependency scenarios can only be satisfied

with a centralized scheduler● Framework complexity

○ Need to deal with resource offers ( not onerous )

Implementation● ~10,000 lines of C++● libprocess - actor-based programming (akka

for c++)● Cgroups / Docker for task isolation● Zookeeper for leader election (HA)● Ports / plugins

○ Hadoop - map reduce○ Torque / MPI○ Spark Framework - iterative jobs

Evaluation

Dynamic Resource sharing

Evaluation

Data locality with resource offers○ 16 Hadoop instances using 93 EC2 instances○ 1.7x speedup with mesos○ 97% data locality with 5sec delay scheduling

Evaluation

Mesos scalability● 99 EC2 instances● Scaled to 50k slaves, 200 frameworks, 100k tasks

Evaluation

● Failure recovery○ Mean time to recovery was between 4-8 seconds

with 95% confidence of 3secs on either side.● Performance isolation

○ Linux containers are not perfect at isolation○ 30% increase in request latency vs 550% increase

during tests■ apache server + cpu hog process

The end

mesos - a platform for fine-grained resource sharing in the data center

Engineering