Download - Podila mesos con-northamerica_sep2017
Practical Container Scheduling: Juggling Optimizations, Guarantees, and
Trade-Offs at NetflixSharma Podila, Senior Software Engineer, Netflix
Consider this
You got yourselves an Apache Mesos cluster. Yeah!
Can you– Guarantee capacity for all your applications?– Optimize assignments for locality, affinity?– Keep the cluster size elastic?– Minimize total usage footprint?
About me
• Works in Edge Engineering at Netflix– Distributed resource scheduling– Worked on projects Mantis and Titus
• Created Netflix OSS Fenzo
• Previously, built resource scheduling for HPC like batch processing in data center environments
Agenda
• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?
Agenda
• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?
Reactive stream processing: Mantis
Zuul Cluster
API Cluster
MantisStream processing
Cloud native service
● Configurable message delivery guarantees● Heterogeneous workloads
○ Real-time dashboarding, alerting○ Anomaly detection, metric generation○ Interactive exploration of streaming data
AnomalyDetection
Container deployment: Titus
EC2
VPC
VMVMTi
tus
Job
Con
trol
Containers
AppCloud Platform
(metrics, IPC, health)
VMVM
BatchContainers
Eureka EddaAtlas & Insight
What the cluster needs to support
• Heterogeneous mix of workload– Vary in # of CPUs, memory, network, local disk– Vary in criticality and runtime duration
• Resource demand variation over time– Data volume variation in Mantis– Number of containers in Titus
Agenda
• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?
Why juggle at all?
If we had unlimited resources for all workloads, there is no need to juggle
Why juggle at all?
If we had unlimited resources for all workloads, there is no need to juggle
If you are running on an elastic cloud, don’t you have unlimited resources?
Why juggle at all?
• Demand vs. Supply
Why juggle at all?
• Demand vs. Supply
Why juggle at all?
• Demand vs. Supply• Efficiency
About 50% utilized
Why juggle at all?
• Demand vs. Supply• Efficiency• Workload types
– critical user facing, pre-compute for production, experimentation, testing, “idle-soak”
– services, batch, stream processing
Agenda
• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?
Scheduling challenge in large clusters
• ComplexitySpeed Accuracy
First fit assignment Optimal assignment
Real world trade-offs
Scheduling challenge in large clusters
• Complexity• Speed of scheduling; a slow scheduler can
– Leave servers idle longer– Make inefficient and incorrect assignments
Our initial goals for a cluster scheduler
• Multi goal optimization for task placement• Cluster autoscaling• Extensibility
Our initial goals for a cluster scheduler
• Multi goal optimization for task placement• Cluster autoscaling• Extensibility
• Security• Capacity guarantees• Reasoning about allocation failures
Multi goal task placement
DC/Cloud operator
Application owner
Cost Security
Move in the generally right direction
Cluster autoscaling
Large variation in peak to trough resource requirements
Mantis events/sec
12M
2M
Titus concurrent containers
1000s
10s
Cluster autoscaling
Host 1 Host 2 Host 3 Host 4
• Scaling up a cluster is relatively easy
Cluster autoscaling
Host 4Host 3Host 1vs.
Host 1 Host 2
Host 2
Host 3 Host 4
• Scaling up a cluster is relatively easy• Scaling down requires bin packing
Cluster autoscaling
Host 4Host 3Host 1vs.
Host 1 Host 2
Host 2
Host 3 Host 4
• Scaling up a cluster is relatively easy• Scaling down requires bin packing
Security
SecGrp A
Task 0
SecGrp Y,Z
Task 1 Task 2 Task 3
app
SecGrp X
app
SecGrp X
appapp
Host foo
Mixing tasks with different security access on a single host
Capacity guarantees
Guarantee capacity to all applications per SLA
Critical
FlexCritical
Flex
ResourceAllocationOrder
Quotas Prioritiesvs.
Reasoning about allocation failures
• Why is a job not running?• What resources are we not able to
allocate?• How many servers are failing the resource
requests?
Agenda
• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?
Core scheduling, job control plane
AWS EC2
Apache Mesos
Titus/Mantis Framework
Fenzo
Batch Job Mgr
Service Job Mgr
Fenzo scheduling strategy
Fitness
Pending
Assigned
Urg
ency
N tasks to assign from M possible agents
Fenzo, OSS scheduling library
Benefits for any JVM Mesos framework:• Extensibility via plugins• Cluster autoscaling• Tiered queues with weighted DRF• Control for speed vs. optimal assignments• Ease of experimentation
github.com/Netflix/Fenzo
Fenzo scheduling strategyFor each (ordered) task
On each available host
Validate hard constraintsEval score for fitness and soft constraints
Until score good enough, and
A minimum #hosts evaluated
Pick host with highest score
Fitness and constraints are plugins
Fenzo scheduling strategyFor each (ordered) task
On each available host
Validate hard constraintsEval score for fitness and soft constraints
Until score good enough, and
A minimum #hosts evaluated
Pick host with highest score
Fitness and constraints are plugins
Plugins
Fitness functions we use
• CPU, memory, and network bin packing
E.g., CPU fitness = usedCPUs / totalCPUs
Fitness functions we use
• CPU, memory, and network bin packing
Fitness for 0.25 0.5 0.75 1.0 0.0
✔
Host1 Host2 Host3 Host4 Host5Host1 Host2 Host3 Host4 Host5
Fitness functions we use
• CPU, memory, and network bin packing• Task runtime profile type - perpetual vs.
finite time
Fitness functions we use
• CPU, memory, and network bin packing• Task runtime profile type - perpetual vs.
finite time• Minimize concurrent launch of tasks on an
individual host
Fitness functions we use
• CPU, memory, and network bin packing• Task runtime profile type - perpetual vs.
finite time• Minimize concurrent launch of tasks on an
individual hostfitness = binPacking * w1 + runtime * w2 + launch * w3
Hard constraints we use
• GPU server matching– Use agent with GPU only if task requires one
• Match tasks with resources earmarked for queue tiers
Soft constraints we use
• Specified by individual jobs at submit time• Balance tasks of a job across availability
zones• Balances tasks of services across hosts
Mixing fitness with soft constraints
Agent score = fitness score * 0.4 +soft constraint score * 0.6
Our queues setup
AppC
1
AppC
2
AppC
3
AppC
N
AppF1
AppF2
AppFM
AppF3
ResourceAllocationOrder
Critical(Tier 0)
Flex(Tier 1)
Separate tiers based on how quickly resources need to be allocated
Weighted DRF across buckets in a tier
User interface for capacity guarantee
• Application setup– Specify total capacity needs for an applicationE.g., “4-CPU, 8GB, 512 Mbps” times 120 containers
• User specifies “application name”• A “default” catch-all bucket supports
experimentation• Cluster admin maps applications to tiers
Sizing agent clusters for capacity
Tier 0: Used capacity
Idle capacity Autoscaled
Cluster min size (guaranteed capacity)Cluster max Size
Tier 1: Used capacity Autoscaled
Cluster desired sizeCluster max Size
(Idle size kept near zero)
Reasoning about allocation failures
Agenda
• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?
• Task evictions
What’s next?
What’s next?
• Task evictions
• “Noisy neighbors” feedback from agents
What’s next?
• Task evictions
• “Noisy neighbors” feedback from agents
• Automated rollout of new agent code
Questions?
Practical Container Scheduling: Juggling Optimizations, Guarantees, and Trade-Offs
at Netflix
Sharma Podila @podila