mesos - a platform for fine-grained resource sharing in the data center
TRANSCRIPT
MesosA Platform for Fine-Grained Resource Sharing in the Data Center
@PapersWeLoveSEA@ankurcha - Ankur Chauhan
Benjamin Hindman, Andy konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
Background
Rapid innovation in distributed/cluster computing● Hadoop, Spark, Flink, Tez, HDFS, Dryad etc● Micro-services / web-services (long running)● <Your custom framework> …
… There is a lot of wasted effort.
Problem
● Rapid innovation in cluster/dist. frameworks● No single framework is optimal for all
workloads
● What do we want?○ Run multiple frameworks on a single cluster (big)
■ Maximize utilisation of resources. (elastic)■ Share data between frameworks.
Goal
Solution
Mesos Goals
● High utilization of resources● Support diverse frameworks● Scalability to 10,000s of nodes● Reliability in face of failures● Efficient with minimal overhead● Highly available via leader election of
master.
Other benefits
● Run multiple instances of a framework○ Isolate production and experimental tasks○ Run multiple version of a framework
● Build special framework targeting a particular domain○ eg: spark
Mesos Architecture
● Resource Offers○ Offer available resources to frameworks
● 2-level scheduler○ Mesos takes the decision of which framework gets
the offer.○ Frameworks decide whether to accept/reject offer.
● Fine grained sharing○ Improved utilization, responsiveness, data locality
● Pluggable allocation module
Mesos Architecture
High level architecture
Design components
● Mesos Master○ Runs the allocation module
● Mesos Slave○ Offers resources○ task/executor isolation
● Scheduler○ Accepts/rejects resource offers
● Executor○ Runs tasks
● Task○ Unit of work
Mesos Architecture
● Making resource offers scalable and robust○ Let schedulers define filters○ Offered resources counted as allocations○ Rescind offers if not accepted within timeout
● Fault tolerance○ Zookeeper for leader election (hot-standby)○ Mesos reports slave and executor failures to
scheduler○ Minimal internal state in master
Resource offers● Mesos decides how many
resources to offer to each framework, based on allocation policy (fair share), while framework decide which resource offers to accept and which tasks to run on them.
● When a framework accepts an offer, it passes Mesos a description of the task (and executor) to launch.
Analysis
● Resource offers work well when:○ Frameworks can scale elastically.○ Task durations are homogeneous.○ Frameworks have many preferred nodes.
● Conditions hold true in many frameworks (hadoop, dryad, spark … )○ Work is divided into short tasks○ Data is replicated across multiple nodes
Limitations of distributed scheduling
● Fragmentation○ With heterogeneous resource demands, distributed
collection of frameworks may not be able to bin pack optimally.
● Independent framework constraints○ Esoteric dependency scenarios can only be satisfied
with a centralized scheduler● Framework complexity
○ Need to deal with resource offers ( not onerous )
Implementation● ~10,000 lines of C++● libprocess - actor-based programming (akka
for c++)● Cgroups / Docker for task isolation● Zookeeper for leader election (HA)● Ports / plugins
○ Hadoop - map reduce○ Torque / MPI○ Spark Framework - iterative jobs
Evaluation
Dynamic Resource sharing
Evaluation
Data locality with resource offers○ 16 Hadoop instances using 93 EC2 instances○ 1.7x speedup with mesos○ 97% data locality with 5sec delay scheduling
Evaluation
Mesos scalability● 99 EC2 instances● Scaled to 50k slaves, 200 frameworks, 100k tasks
Evaluation
● Failure recovery○ Mean time to recovery was between 4-8 seconds
with 95% confidence of 3secs on either side.● Performance isolation
○ Linux containers are not perfect at isolation○ 30% increase in request latency vs 550% increase
during tests■ apache server + cpu hog process
The end