![Page 1: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/1.jpg)
Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group
Imperial College London
SpotnikDesigning Distributed Machine Learning for Transient Cloud
Resources
![Page 2: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/2.jpg)
Distributed ML
2
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
Δ
![Page 3: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/3.jpg)
Distributed ML
3
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
ΔLearn a model
![Page 4: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/4.jpg)
Distributed ML
4
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
Δ
Data parallelism
![Page 5: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/5.jpg)
Distributed ML
5
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
Δ
Ring all-reduce
![Page 6: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/6.jpg)
Challenges of distributed ML
• Distributed ML is resource-hungry
• Accelerated resources are expensive
Example Megatron-LM3
• Training of BERT-like model
• 512 V100 GPUs
• One epoch (68,507 iterations) takes 2.1 days
• Cost on Azure: $92,613
63Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017
![Page 7: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/7.jpg)
Transient cloud resources
1https://azure.microsoft.com/en-us/pricing/spot/7
• Examples: AWS Spot instances, Azure Spot VMs
• Follows the law of a free market
• Revocations
• Notifications
• Economic incentive
• Offers a cost reduction of up to 90%1
A Megatron-LM epoch would drop from $92,613 to $15,152
worker 2
90%
![Page 8: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/8.jpg)
Implications of transient resources
• New workers become available or old workers get revoked
➙ System must cope with disappearing resources
• Changes can happen at any time
➙ System must ensure consistency of updates
8
worker 0
worker 1
worker 2
worker 3
worker 4
Cluster
Size
![Page 9: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/9.jpg)
Implications of transient resources
• New workers become available or old workers get revoked
➙ System must cope with disappearing resources
• Changes can happen at any time
➙ System must ensure consistency of updates
• Cluster sizes are unknown beforehand
➙ System must adapt to different conditions
9
HogWild!
AD-PSGD
EA-SGD
SMAS-SGD
Network efficiency
Model accuracy
![Page 10: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/10.jpg)
Current approach: Checkpoint & recovery
• Tensorflow and Pytorch
• Changes to the cluster are not considered
• Recovery takes about 20 seconds with ResNet50 and ImageNet
Recovery Training
10
CheckpointTraining
![Page 11: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/11.jpg)
• Mix dedicated resources with transient resources
• Proteus2: Placement of parameter server on dedicated resources so that the training state is save
Parameter server
Worker 0 Worker 1 Worker 2
Dedicated resources
Transient resources
2Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic re- source markets. EuroSys, 201711
Current approaches: Hybrid
![Page 12: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/12.jpg)
Spotnik
12
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
![Page 13: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/13.jpg)
Spotnik
13
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
![Page 14: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/14.jpg)
Revocation recovery algorithm
• Handle revocations within a ring topology
• Number of total messages is bounded by messages
• K is the number of simultaneous revocations
• N is the number of workers
➙ Scale to many transient resources
• No reliance on revocation notifications
O(N ⋅ K)
14
![Page 15: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/15.jpg)
Revocation recovery algorithmRepairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 1, 2, 3, 4, 5]
1 [0, 1, 2, 3, 4, 5]
2 [0, 1, 2, 3, 4, 5]
3 [0, 1, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
15
![Page 16: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/16.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 1, 2, 3, 4, 5]
2 [0, 1, 2, 3, 4, 5]
3 [0, 1, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
16
ReconcileRevocation
Revocation recovery algorithm
![Page 17: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/17.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 1, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
Reconcile
Bypass
Revocation recovery algorithm
17
![Page 18: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/18.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
18
Reconcile
Revocation recovery algorithm
![Page 19: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/19.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 4, 5]
4
5 [0, 1, 2, 3, 4, 5]
19
Reconcile
Revocation
Revocation recovery algorithm
![Page 20: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/20.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
20
Reconcile
Bypass
Revocation recovery algorithm
![Page 21: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/21.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
21
Reconcile
Revocation recovery algorithm
![Page 22: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/22.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
22
Reconcile
Revocation recovery algorithm
![Page 23: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/23.jpg)
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
23
Accept
Revocation recovery algorithm
![Page 24: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/24.jpg)
Repairing a broken all-reduce ring
W0
W2
W3
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 5]
3 [0, 2, 3, 5]
5 [0, 2, 3, 5]
24
Restart
Revocation recovery algorithm
![Page 25: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/25.jpg)
Spotnik
25
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
![Page 26: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/26.jpg)
Atomic worker state update
• Pipelined synchronisation
2626
Parameter Parameter
Sync. Sync.
Parameter
Sync.
Update Update Update
![Page 27: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/27.jpg)
Atomic worker state update
• Pipelined synchronisation
• Revocations can happen meanwhile
➙ Partial update leads to inconsistency
2727
Parameter Parameter
Sync. Sync.
Parameter
Sync.
Update Update Update
![Page 28: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/28.jpg)
Atomic worker state update
• Atomicity: Wait for all synchronisation communications to finish
• Discard updates if communication fails
2828
Parameter Parameter
Sync. Sync.
Parameter
Sync.
Update Update Update
Barrier
![Page 29: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/29.jpg)
Spotnik
29
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
![Page 30: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/30.jpg)
Adaptive synchronisation strategies
• Support a range of synchronisation primitives
• collective and point-to-point synchronisation
• Monitor a metric
• Number of workers
30
![Page 31: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/31.jpg)
Adaptive synchronisation strategies• Support a range of synchronisation primitives
• collective and point-to-point synchronisation
• Monitor a metric
• Number of workers
• Define a policy in the beginning
• When to choose which sync strategy
31
W0
W1W2
3 workers
W0
W3
W2
6 workers
W1
W4
W5
AD-PSGD S-SGD
![Page 32: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/32.jpg)
EvaluationHow does the recovery latency change with increasing number of revocations?
32
Cluster • 16 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
No significant increase of recovery latency if the number of revocation increases
![Page 33: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/33.jpg)
EvaluationHow much does the training slow down if we use atomic worker state updates?
33
Cluster • 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15
*different Setup Cluster • 16 workers Hardware • Huawei ModelArts
• Nvidia V100 • InfiniBand
Software • KungFu 0.2.1 • Tensorflow 1.12
Throughput decrease is small
![Page 34: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/34.jpg)
EvaluationHow does the throughput change, if the cluster changes?
34
Cluster • up to 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
Cluster size5 10 15 20 25 30
![Page 35: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/35.jpg)
EvaluationHow does the throughput change, if the cluster changes?
35
Cluster • up to 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
Cluster size5 10 15 20 25 30
![Page 36: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/36.jpg)
EvaluationHow does the throughput change, if the cluster changes?
Switching from S-SGDto AD-PSGD
36
Cluster • up to 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
Changing clusters need adaptation
Cluster size5 10 15 20 25 30
![Page 37: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/37.jpg)
Conclusion
37
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
![Page 38: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/38.jpg)
Conclusion
38
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
Spotnik• Repair cluster with low overhead
• Ensure consistent model updates
• Adapt to changes of the cluster
![Page 39: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/39.jpg)
Conclusion
39
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
Spotnik• Repair cluster with low overhead
• Ensure consistent model updates
• Adapt to changes of the cluster
KungFu github.com/lsds/KungFu
![Page 40: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College](https://reader035.vdocuments.mx/reader035/viewer/2022071421/611a3fe401d33f36ba168695/html5/thumbnails/40.jpg)
Conclusion
40
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
• Repair cluster with low overhead
• Ensure consistent model updates
• Adapt to changes of the cluster
Website lsds.doc.ic.ac.uk | E-Mail [email protected] Twitter @marwage
SpotnikKungFu github.com/lsds/KungFu