![Page 1: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/1.jpg)
Starling: A Scheduler Architecture for High Performance Cloud Computing
Hang Qu, Omid Mashayekhi, David Terei, Philip LevisStanford Platform Lab Seminar
May 17, 2016
1
![Page 2: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/2.jpg)
High Performance Computing (HPC)
• Arithmetically dense codes (CPU-bound)• Operate on in-memory data
• A lot of machine learning is HPC
SpaceX
2
NOAA Pixar
![Page 3: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/3.jpg)
HPC on the Cloud
• HPC has struggled to use cloud resources• HPC software has certain expectations– Cores are statically allocated, never change– Cores do not fail– Every core has identical performance
• Cloud doesn’t meet these expectations– Elastic resources, dynamically reprovisioned– Expect some failures– Variable performance
3
![Page 4: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/4.jpg)
HPC in a Cloud Framework?
• Strawman: run HPC codes inside a cloud framework (e.g., Spark, MapReduce, Naiad, etc.)– Framework handles all of the cloud’s challenges for you:
scheduling, failure, resource adaptation
• Problem: too slow (by orders of magnitude)– Spark can schedule 2,500 tasks/second– Typical HPC compute task is 10ms
‣ Each core can execute 100 tasks/second‣ A single 18 core machine can execute 1,800 tasks/second
– Queueing theory and batch processing mean you want to operate well below the maximum scheduling throughput
4
![Page 5: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/5.jpg)
Starling
• Scheduling architecture for high performance cloud computing (HPCC)• Controller decides data distribution, workers
decide what tasks to execute– In steady state, no worker/controller communication except
periodic performance updates
• Can schedule up to 120,000,000 tasks/s– Scales linearly with number of cores
• HPC benchmarks run 2.4-3.3x faster
5
![Page 6: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/6.jpg)
Outline
• HPC background• Starling scheduling architecture• Evaluation• Thoughts and questions
6
![Page 7: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/7.jpg)
High Performance Computing
• Arithmetically intense floating point problems– Simulations– Image processing– If you might want to do it on a GPU but it’s not graphics, it’s
probably HPC
• Embedded in a geometry, data dependencies
7
![Page 8: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/8.jpg)
Example: Semi-lagrangian Advection
8
A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step
![Page 9: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/9.jpg)
9
![Page 10: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/10.jpg)
Example: Semi-lagrangian Advection
10
A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step
![Page 11: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/11.jpg)
Example: Semi-lagrangian Advection
11
A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step
Walk velocity vector backwards, interpolate
![Page 12: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/12.jpg)
Example: Semi-lagrangian Advection
12
A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step
Walk velocity vector backwards, interpolate
What if partitions A+B are on different nodes?
A B
![Page 13: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/13.jpg)
Example: Semi-lagrangian Advection
13
A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step
Walk velocity vector backwards, interpolate
What if partitions A+B are on different nodes?
A B
Each partition has partial datafrom other, adjacent partitions(“ghost cells”)
ab
![Page 14: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/14.jpg)
High Performance Computing
• Arithmetically intense floating point problems– Simulations– Image processing– If you might want to do it on a GPU but it’s not graphics, it’s
probably HPC
• Embedded in a geometry, data dependencies– Many small data exchanges after computational steps– Ghost cells, reductions, etc.
14
![Page 15: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/15.jpg)
Most HPC Today (MPI)
15
while (time < duration) { // locally calculate, then global min dt = calculate_dt(); // locally calculate, then exchange update_velocity(dt); // locally calculate, then exchange update_fluid(dt); // locally calculate, then exchange update_particles(dt); time += dt; } Control flow is explicit in each process.
Operate in lockstep.Partitioning and communication is static.A single process fails, program crashes.Runs as fast as slowest process.Assumes every node runs at the same speed.
Problems in the cloud
![Page 16: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/16.jpg)
Most HPC Today (MPI)
16
while (time < duration) { // locally calculate, then global min dt = calculate_dt(); // locally calculate, then exchange update_velocity(dt); // locally calculate, then exchange update_fluid(dt); // locally calculate, then exchange update_particles(dt); time += dt; } Control flow is explicit in each process.
Operate in lockstep.Partitioning and communication is static.A single process fails, program crashes.Runs as fast as slowest process.Assumes every node runs at the same speed.
![Page 17: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/17.jpg)
HPC inside a cloud framework?
17
cloud (EC2, GCP)
HPC code
cloud (EC2, GCP)
cloud framework
HPC codeload balancingfailure recoveryelastic resourcesstragglers
{
![Page 18: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/18.jpg)
Frameworks too slow
• Depend on a central controller to schedule tasks to worker nodes• Controller can schedule ~2,500 tasks/second• Fast, optimized HPC tasks can be ~10ms long– 18-core worker executes
1,800 tasks/second– Two workers overwhelm
a scheduler– Prior work is fast or
distributed, not both
18
TasksController
Worker
Worker
Worker
Driver
TasksStats
![Page 19: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/19.jpg)
Outline
• HPC background• Starling scheduling architecture• Evaluation• Thoughts and questions
19
![Page 20: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/20.jpg)
Best of both worlds?
• Depend on a central controller to– Balance load– Recover from failures– Adapt to changing resources– Handle stragglers
• Depend on worker nodes to– Schedule computations
‣ Don’t execute in lockstep‣ Use AMT/software processor model (track dependencies)
– Exchange data
20
![Page 21: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/21.jpg)
Control Signal
• Need a different control signal between controller and workers to schedule tasks• Intuition: tasks execute where data is resident– Data moves only for load balancing/recovery
• Controller determines how data is distributed• Workers generate schedule tasks based on
what data they have
21
![Page 22: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/22.jpg)
22
RDDLineage�DriverProgram�
Par//onManagerMaster�
P1� P3�W2� W2�
P2�W1�
T1� T2�
T4�
T3�
Scheduler�
TaskGraph�
Par//onManagerSlave�
P1�W2�
P2�W1� T1�
T2�
TaskQueue�
Reportdata
placement�
Slavesreportlocalpar//ons�
Par//onManagerSlave�
P1�W2�
P3�W2� T3�
TaskQueue�
SparkController�
SparkWorker1� SparkWorker2�
Controller Architectures
![Page 23: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/23.jpg)
Controller Architectures
23
RDDLineage�DriverProgram�
Par//onManagerMaster�
P1� P3�W2� W2�
P2�W1�
T1� T2�
T4�
T3�
Scheduler�
TaskGraph�
Par//onManagerSlave�
P1�W2�
P2�W1� T1�
T2�
TaskQueue�
Reportdata
placement�
Slavesreportlocalpar//ons�
Par//onManagerSlave�
P1�W2�
P3�W2� T3�
TaskQueue�
SparkController�
SparkWorker1� SparkWorker2�
Par$$onManager�
P1� P3�W2� W2�
P2�W1�
Snapshot�
Scheduler�Decidedataplacement�
Broadcastdataplacement�
StarlingController�
CanaryWorker1� CanaryWorker2�
Par$$onManagerReplica�
P1� P3�W2� W2�
P2�W1�
T3�
Scheduler
Par$$onManagerReplica�
P1� P3�W2� W2�
P2�W1�
T1�TaskGraph
Specifyastagetosnapshotat�
T1�
T3�TaskQueue
DriverProgram
T4�
Scheduler
T2�TaskGraph T2�
TaskQueue
DriverProgram
![Page 24: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/24.jpg)
Partitioning Specification
• Driver program computes valid placements– Task accesses determine a set of constraints
‣ E.g., A1 and B1 must be on the same node, A2 and B2, etc.
• Micropartitions (overdecomposition)– If expected max cores is n, create 2-8n partitions
24
![Page 25: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/25.jpg)
Managing Migrations
• Controller does not know where workers are in the program• Workers do not know where each other are in
program• When a data partition moves, need to ensure
that destination picks up where source left off
25
![Page 26: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/26.jpg)
Example Task Graph
26
write�
read�
sca*erW0�
sum0�
gatherG0�
again?0�
GW0�
GG0�
gatherW0�
gradient0�
sca*erG0�
LW0�
TD0�
LG0�
LW1�
TD1�
LG1�
gatherW1�
gradient1�
sca*erG1�
LW: local weightTD: training dataLG: local gradientGW: global weightGG: global gradient
![Page 27: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/27.jpg)
Execution Tracking
27
sca$erW�
gatherG�
GG0�
GW0�
gatherW�
gradient�
sca$erG�
LW0�
TD0�
LG0�
:�
:��
:��
:��
:�sum�
again?�
StageId� Stage� Par++on�:Metadata�
Programposi;on�
sca$erW�
71�
72�
73�
74�
75�
76�
77�
78�
66�
72�
67�
70�
71�
...Each worker executes the identical program in parallelControl flow is identical (like GPUs)Tag each partition with last stage that accessed itSpawn all subsequent stagesData exchanges implicitly synchronize
![Page 28: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/28.jpg)
Implicitly Synchronize
28
write�
read�
sca*erW0�
sum0�
gatherG0�
again?0�
GW0�
GG0�
gatherW0�
gradient0�
sca*erG0�
LW0�
TD0�
LG0�
LW1�
TD1�
LG1�
gatherW1�
gradient1�
sca*erG1�
LW: local weightTD: training dataLG: local gradientGW: global weightGG: global gradient
Stage gatherG0 cannot execute until both gradient1 and gradient2 complete
Stage again?0 cannot execute until sum0 completes
![Page 29: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/29.jpg)
Spawning and Executing Local Tasks
• Metadata for each data partition is the last stage that reads from or writes to it.• After finishing a task, a worker:– Updates metadata.
– Examines task candidates that operate on data partitions generated by the completed task.
– Puts a candidate task into a ready queue if all data partitions it operates on are (1) local, and (2) modified by the right tasks.
29
![Page 30: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/30.jpg)
Two Communication Primitives
• Scatter-gather– Similar to MapReduce, Spark GroupBy, etc.– Takes one or more datasets as input– Produces one or more datasets as output– Used for data exchange
• Signal– Deliver a boolean result to every node– Used for control flow
30
![Page 31: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/31.jpg)
EVALUATION
31
![Page 32: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/32.jpg)
Evaluation Questions
• How many tasks/second can Starling schedule?• Where are the scheduling bottlenecks?• Does this improve computational performance?– Logistic regression– K-means, PageRank– Lassen– PARSEC fluidanimate
• Can Starling adapt HPC applications to changes in available nodes?
32
![Page 33: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/33.jpg)
Scheduling Throughput
33
0 10 20 30 40 50 60 70
Average task length (µs)
40
60
80
100
CP
Uut
iliza
tion
(6.6µs, 90%)
A single core can schedule 136,000 tasks/second while using 10% of CPU cycles.Scheduling puts a task on a local queue (no locks); no I/O needed.
![Page 34: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/34.jpg)
Scheduling Throughput
34
Scheduling throughput increases linearly with number of cores.120,000,000 tasks/second on 1,152 cores (10% CPU overhead)
Bottleneck is data transmission on partition migration.
![Page 35: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/35.jpg)
Overprovisioned Scheduling
35
Workload is 3,600 tasks/second.Queueing delay due to batched scheduling meansmuch higher throughput needed.
![Page 36: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/36.jpg)
Queueing Delay
36
Core�
task(1s)�
task(1s)�
task(1s)�
task(1s)�
Core�
Core�
Core�
0.00s� 0.25s� 0.50s� 0.75s� 1.50s�1.00s� 1.25s� 1.75s�Controllerdispatches4tasks/sec�
Load is 4 tasks/second.If scheduler can handle 4 tasks/second, queueing delay increasesexecution time from 1s to 1.75s (75%).
![Page 37: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/37.jpg)
Logistic Regression
37
Optimal micropartitioning on 32 workers is >18,000 tasks/second
Controller: c4.large instance (1 core)32 workers: c4.8xlarge instance (18 cores)
![Page 38: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/38.jpg)
Lassen(Eulerian)
38Controller: c4.large instance (1 core)32 workers: c4.8xlarge instance (18 cores)
Micro partitioning + load balancing runs3.3x faster than MPI.
![Page 39: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/39.jpg)
PARSEC fluidanimate(Lagrangian)
39Controller: c4.large instance (1 core)32 workers: c4.8xlarge instance (18 cores)
Micro partitioning + load balancing runs2.4x faster than MPI.
![Page 40: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/40.jpg)
Migration Costs
40
32 nodes, reprovisioned so computation moves off of 16 nodesonto 16 new nodes. Data transfer takes several seconds at 9Gbps.
Migration times are huge compared to task execution times.Migration not a bottleneck at controller.
![Page 41: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/41.jpg)
Starling Evaluation Summary
• Can schedule 136,000 tasks/s/core– 120,000,000/s on 1,152 cores– Need to overprovision scheduling/high throughput
• Schedules HPC applications on >1,000 cores• Micropartitions improve performance– Increase required throughput: 5-18,000 tasks/s on 32 nodes
• Central load balancer improves performance– 2.4-3.3x for HPC benchmarks
• Can adapt to changing resources
41
![Page 42: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/42.jpg)
Outline
• HPC background• Starling scheduling architecture• Evaluation• Thoughts and questions
42
![Page 43: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/43.jpg)
Thoughts
• Many newer cloud computing workloads resembled high performance computing– I/O bound workloads are slow– CPU bound workloads are fast
• Next generation systems will draw from both– Cloud computing: variation, completion time, failure,
programming models, system decomposition– HPC: scalability, performance
43
![Page 44: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud](https://reader031.vdocuments.mx/reader031/viewer/2022022422/5a9da6547f8b9abd0a8babcc/html5/thumbnails/44.jpg)
Thank you!
PhilipLevis
HangQu
OmidMashayekhi
DavidTerei
ChinmayeeShah