model-switching: dealing with fluctuating workloads in ... · model-switching: dealing with...
TRANSCRIPT
Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems
1New York University, 2Microsoft
USENIX HotCloud 2020 [*] Machine Learning-as-a-Service
Jeff Zhang1, Sameh Elnikety2, Shuayb Zarar2, Atul Gupta2, Siddharth Garg1
Deep-Learning Models are Pervasive
1
Computations in Deep Learning
Activation Tensor
Convolution
Convolution FiltersOutput Tensor
=
Output Tensor
...
[Ref.] Sze, V., et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.
Convolutions account for more than 90% computation à dominate both run-timeand energy.*
2
Execution time factors: depth, activation/filter size in each layer
OfflineTraining
Data
DataCollection
Cleaning &Visualization
Feature Eng. &Model Design
Training &Validation
Model Development
TrainedModelsTraining Pipelines
LiveData
Training
Validation
End UserApplication
Query
Prediction
Prediction Service
Inference
Feedback
Logic
DataScientist
DataEngineer
DataEngineer
Machine Learning Lifecycle (Typical)
3[Ref.] Gonzalez, J. et al., “Deploying Interactive ML Applications with Clipper.”
Online
Model Development• Prescribes model design, architecture and data processing
Training• At scale on live data• Retraining on new data and manage model versioning
Serving or Online Inference• Deploys trained model into device, edge, or cloud
MLaaS
MLaaS: Challenges and Limitations
4[Ref.] Gujarati, A. et al., “Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.
Existing Solution - Static model versioning
- Tie each application to one specific model at run-time- In the event of load spikes:
• Prune requests (new, low priority, near deadline etc.) - QoS violations, customer L
• Add “significant new capacity" (autoscaling)- Not economically viable, provider L
Maintain QoS under dynamic workloads.
- Service success rates dropped below 99.99%- Teams suffered 2-hour outage in Europe- Free offers and new subscriptions were limited
March 28, 2020
0 200 400 600 800 1000Execution Time (ms)
90
90.5
91
91.5
92
92.5
93
93.5
94
94.5
Acc
urac
ySingle Image Inference with ResNet-x
1 thread2 threads4 threads8 threads16 threadsResNet-18
ResNet-34
ResNet-50
ResNet-101
ResNet-152
Opportunity: DNN Model Diversity
For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism)
[Ref.] He, K., et al., “Deep Residual Learning for Image Recognition”, IEEE CVPR 2016; 5
Res
idua
l Blo
ck
Model
Assumption: fluctuating workloadsfixed hardware capacity
6
Which DNN model to use?
How to allocate resources?
MLaaS Provider
ML AppEnd Users
What is the QoS?
sending requests
rendering predictions
SLAs
Typical SLA Objectives: latency, throughput, cost,…
Make all decisions online!
Questions in this Study
2. “Dog”, 0.3s
7
MLaaS Provider
ApplicationEnd Users
What Do Users Care About?
1. “Cat”, 5s
From the users’ perspective, deadline misses and incorrect predictions are equally bad:
• User can always meet deadline by guessing randomly Quick and correct predictions!
0 5 10 15 20 25 30Load (Queries Per Second)
0.88
0.89
0.9
0.91
0.92
0.93
0.94
Effe
ctiv
e A
ccur
acy
ResNet152ResNet101ResNet50ResNet34ResNet18
A New Metric for MLaaS
8
No single DNN works best at all load levels
(𝜆: load, 𝑎: baseline accuracy)
Effective Accuracy (𝑎!"") : the fraction of correct predictions within deadline (D)
Likelihood of meeting deadline
𝑎#$$ = 𝑝%,'×𝑎
1 2 3 4 5 6 7 8Load (Queries Per Second)
0
500
1000
1500
2000
2500
3000
3500
4000
450099
th p
erce
ntile
tail
late
ncy
(ms)
ResNet-152 Tail Latency Vs Job arrival rate
R:16 T:1R:8 T:2R:4 T:4R:2 T:8R:1 T:16
Characterizing DNN Parallelism
As load increases, additional replicas help more than threads.9
Fixed Capacity:16 threads
End to EndQuery Latency
R: ReplicasT: Threads
Online Model Switching Framework
10
Load Change Detection
Offline policy training
Model Switching
Load Best Model Parallelism
0-4 QPS ResNet-152 <R:4 T:4>
> 20 ResNet-18 …
Model that exhibits best effective accuracy is a function of load
Front-End Queries
Predictions
Dynamically select best model (effective accuracy) based on load
Model Switching Controller
Online Serving
SLA deadline
Experimental Setup
• Built on top of Clipper, an open-source containerized model-serving
framework (caching, adaptive batching disabled)
• Deployed PyTorch pretrained ResNet models on ImageNet (R: 4 T: 4)
• Two dedicated Azure VMs:
• Server: 32 vCPUs + 128GB RAM
• Client: 8vCPUs + 32GB RAM
• Markov Model based load generator
• Open system model
• Poisson inter-arrivals
11
Model Switching
[Ref.] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.
sampling period 1 sec
Evaluation: Automatic Model-Switching
12
0 50 100 150 200 250 300Time (sec)
0
5
10
15
20
25Lo
ad (Q
uerie
s Pe
r Sec
ond)
ResNet-18
ResNet-34
ResNet-50
ResNet-101
ResNet-152
Model-Switching can quickly adapt to load spikes.
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5Deadline (ms)
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Effe
ctiv
e A
ccur
acy
Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152
(s)
Evaluation: Effective Accuracy
13
Model-Switching achieves pareto-optimal effective accuracy.
Evaluation: Tail Latency
14
Model-Switching tradeoffs deadline slack for accuracy.
SLA deadline:750 ms
0 0.5 1 1.5 2Latency (ms)
0
0.2
0.4
0.6
0.8
1
Perc
enta
ge
Empirical CDF
Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152
0.75(s)
Model-Switching: Manage Fluctuating Workloads in MLaaS Systems
Thank You and Questions
Jeff [email protected]
• How to prepare a pool of models for each application?• Neural Architecture Search, Multi-level Quantization
• Current approach pre-deploys all (20) candidate models • Cold start time (ML): tens of seconds • RAM overheads: currently 11.8% of the total 128 GB RAM
• Reinforcement learning based controller for model switching• Account for job queue status, system load, current latency• Offline training free
• Integrate with existing MLaaS techniques• Batching, caching, autoscaling etc.
• Exploit availability of heterogenous computing resources• CPU, GPU, TPU, FPGA
Discussion and Future Work
15