model-switching: dealing with fluctuating workloads in ... · model-switching: dealing with...

17
Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems 1 New York University, 2 Microsoft [email protected] USENIX HotCloud 2020 [*] Machine Learning-as-a-Service Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1

Upload: others

Post on 08-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems

1New York University, 2Microsoft

[email protected]

USENIX HotCloud 2020 [*] Machine Learning-as-a-Service

Jeff Zhang1, Sameh Elnikety2, Shuayb Zarar2, Atul Gupta2, Siddharth Garg1

Page 2: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Deep-Learning Models are Pervasive

1

Page 3: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Computations in Deep Learning

Activation Tensor

Convolution

Convolution FiltersOutput Tensor

=

Output Tensor

...

[Ref.] Sze, V., et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.

Convolutions account for more than 90% computation à dominate both run-timeand energy.*

2

Execution time factors: depth, activation/filter size in each layer

Page 4: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

OfflineTraining

Data

DataCollection

Cleaning &Visualization

Feature Eng. &Model Design

Training &Validation

Model Development

TrainedModelsTraining Pipelines

LiveData

Training

Validation

End UserApplication

Query

Prediction

Prediction Service

Inference

Feedback

Logic

DataScientist

DataEngineer

DataEngineer

Machine Learning Lifecycle (Typical)

3[Ref.] Gonzalez, J. et al., “Deploying Interactive ML Applications with Clipper.”

Online

Model Development• Prescribes model design, architecture and data processing

Training• At scale on live data• Retraining on new data and manage model versioning

Serving or Online Inference• Deploys trained model into device, edge, or cloud

MLaaS

Page 5: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

MLaaS: Challenges and Limitations

4[Ref.] Gujarati, A. et al., “Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.

Existing Solution - Static model versioning

- Tie each application to one specific model at run-time- In the event of load spikes:

• Prune requests (new, low priority, near deadline etc.) - QoS violations, customer L

• Add “significant new capacity" (autoscaling)- Not economically viable, provider L

Maintain QoS under dynamic workloads.

- Service success rates dropped below 99.99%- Teams suffered 2-hour outage in Europe- Free offers and new subscriptions were limited

March 28, 2020

Page 6: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

0 200 400 600 800 1000Execution Time (ms)

90

90.5

91

91.5

92

92.5

93

93.5

94

94.5

Acc

urac

ySingle Image Inference with ResNet-x

1 thread2 threads4 threads8 threads16 threadsResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152

Opportunity: DNN Model Diversity

For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism)

[Ref.] He, K., et al., “Deep Residual Learning for Image Recognition”, IEEE CVPR 2016; 5

Res

idua

l Blo

ck

Model

Page 7: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Assumption: fluctuating workloadsfixed hardware capacity

6

Which DNN model to use?

How to allocate resources?

MLaaS Provider

ML AppEnd Users

What is the QoS?

sending requests

rendering predictions

SLAs

Typical SLA Objectives: latency, throughput, cost,…

Make all decisions online!

Questions in this Study

Page 8: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

2. “Dog”, 0.3s

7

MLaaS Provider

ApplicationEnd Users

What Do Users Care About?

1. “Cat”, 5s

From the users’ perspective, deadline misses and incorrect predictions are equally bad:

• User can always meet deadline by guessing randomly Quick and correct predictions!

Page 9: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

0 5 10 15 20 25 30Load (Queries Per Second)

0.88

0.89

0.9

0.91

0.92

0.93

0.94

Effe

ctiv

e A

ccur

acy

ResNet152ResNet101ResNet50ResNet34ResNet18

A New Metric for MLaaS

8

No single DNN works best at all load levels

(𝜆: load, 𝑎: baseline accuracy)

Effective Accuracy (𝑎!"") : the fraction of correct predictions within deadline (D)

Likelihood of meeting deadline

𝑎#$$ = 𝑝%,'×𝑎

Page 10: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

1 2 3 4 5 6 7 8Load (Queries Per Second)

0

500

1000

1500

2000

2500

3000

3500

4000

450099

th p

erce

ntile

tail

late

ncy

(ms)

ResNet-152 Tail Latency Vs Job arrival rate

R:16 T:1R:8 T:2R:4 T:4R:2 T:8R:1 T:16

Characterizing DNN Parallelism

As load increases, additional replicas help more than threads.9

Fixed Capacity:16 threads

End to EndQuery Latency

R: ReplicasT: Threads

Page 11: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Online Model Switching Framework

10

Load Change Detection

Offline policy training

Model Switching

Load Best Model Parallelism

0-4 QPS ResNet-152 <R:4 T:4>

> 20 ResNet-18 …

Model that exhibits best effective accuracy is a function of load

Front-End Queries

Predictions

Dynamically select best model (effective accuracy) based on load

Model Switching Controller

Online Serving

SLA deadline

Page 12: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Experimental Setup

• Built on top of Clipper, an open-source containerized model-serving

framework (caching, adaptive batching disabled)

• Deployed PyTorch pretrained ResNet models on ImageNet (R: 4 T: 4)

• Two dedicated Azure VMs:

• Server: 32 vCPUs + 128GB RAM

• Client: 8vCPUs + 32GB RAM

• Markov Model based load generator

• Open system model

• Poisson inter-arrivals

11

Model Switching

[Ref.] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.

sampling period 1 sec

Page 13: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Evaluation: Automatic Model-Switching

12

0 50 100 150 200 250 300Time (sec)

0

5

10

15

20

25Lo

ad (Q

uerie

s Pe

r Sec

ond)

ResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152

Model-Switching can quickly adapt to load spikes.

Page 14: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5Deadline (ms)

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Effe

ctiv

e A

ccur

acy

Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152

(s)

Evaluation: Effective Accuracy

13

Model-Switching achieves pareto-optimal effective accuracy.

Page 15: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Evaluation: Tail Latency

14

Model-Switching tradeoffs deadline slack for accuracy.

SLA deadline:750 ms

0 0.5 1 1.5 2Latency (ms)

0

0.2

0.4

0.6

0.8

1

Perc

enta

ge

Empirical CDF

Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152

0.75(s)

Page 16: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

Model-Switching: Manage Fluctuating Workloads in MLaaS Systems

Thank You and Questions

Jeff [email protected]

Page 17: Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with Fluctuating Workloads in MLaaS*Systems 1New York University, 2Microsoft jeffjunzhang@nyu.edu

• How to prepare a pool of models for each application?• Neural Architecture Search, Multi-level Quantization

• Current approach pre-deploys all (20) candidate models • Cold start time (ML): tens of seconds • RAM overheads: currently 11.8% of the total 128 GB RAM

• Reinforcement learning based controller for model switching• Account for job queue status, system load, current latency• Offline training free

• Integrate with existing MLaaS techniques• Batching, caching, autoscaling etc.

• Exploit availability of heterogenous computing resources• CPU, GPU, TPU, FPGA

Discussion and Future Work

15