lecture 12: model serving - university of washingtondlsys.cs.washington.edu/pdf/lecture12.pdflecture...

48
Lecture 12: Model Serving CSE599W: Spring 2018

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Lecture 12: Model Serving

CSE599W: Spring 2018

Page 2: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Deep Learning Applications

“That drink will get you to 2800 calories for today”

“I last saw your keys in the store room”

“Remind Tom of the party”

“You’re on page 263 of this book”

Intelligent assistant Surveillance / Remote assistance Input keyboard

Page 3: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Model Serving Constraints● Latency constraint

○ Batch size cannot be as large as possible when executing in the cloud○ Can only run lightweight model in the device

● Resource constraint○ Battery limit for the device○ Memory limit for the device○ Cost limit for using cloud

● Accuracy constraint○ Some loss is acceptable by using approximate models○ Multi-level QoS

Page 4: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Runtime Environment

Sensor Processor Radio Cloud

4

CameraTouch screenMicrophone

CPUGPUFPGAASIC

Wifi4G / LTEBluetooth

CPU serverGPU serverTPU / FPGA server

streaming

Page 5: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Resource usage for a continuous vision app

Imager Processor Radio

Omnivision OV274090mW

Tegra K1 GPU290GOPS@10W= 34pJ/OP

Qualcomm SD810 LTE>800mWAtheros 802.11 a/g15Mbps@700mW= 47nJ/b

Cloud

Amazon EC2CPU c4.large 2x400GFLOPS $0.1/hGPU g2.2xlarge 2.3TFLOPS $0.65/h

5Huge gap between workload and budget

BudgetDevice power Cloud cost

30% of 10Wh for 10h = 300mW $10 person/year

Workload Deep learning 300GFLOPS @ 30GFLOPs/frame, 10fps

Compute power

9GFLOPS 3.5GFLOPS (GPU) / 8GFLOPS (CPU)

Page 6: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Outline● Model compression

● Serving System

Page 7: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Model Compression● Tensor decomposition● Network pruning● Quantization● Smaller model

Page 8: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Matrix decomposition

* *

Fully-connected layer● Memory reduction:● Computation reduction:

Merge into one matrix

M

N

M

R

R

R

N

Page 9: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Tensor decompositionConvolutional layer● Memory reduction: ● Computation reduction:

* Kim, Yong-Deok, et al. "Compression of deep convolutional neural networks for fast and low power mobile applications." ICLR (2016).

bounded by

Page 10: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Decompose the entire model

Page 11: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Fine-tuning after decomposition

Page 12: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Accuracy & Latency after Decomposition

Page 13: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Network pruning: Deep compression

* Song Han, et al. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR (2016).

Page 14: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Network pruning: prune the connections

Page 15: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Network pruning: weight sharing1. Use k-means clustering to

identify the shared weights for each layer of a trained network. Minimize

2. Finetune the neural network using shared weights.

Page 16: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Network pruning: accuracy

Page 17: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Network pruning: accuracy vs compression

Page 18: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Quantization: XNOR-Net

* Mohammad Rastegari, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV (2016).

Page 19: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Quantization: binary weights

Page 20: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Quantization: binary input and weights

Page 21: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Quantization: accuracy

Page 22: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Quantization

Page 23: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Quantization

Page 24: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Smaller model: Knowledge distillation● Knowledge distillation: use a teacher model (large model) to train a

student model (small model)

* Romero, Adriana, et al. "Fitnets: Hints for thin deep nets." ICLR (2015).

Page 25: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Smaller model: accuracy

Page 26: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

DiscussionWhat are the implications of these model compression techniques for serving?

● Specialized hardware for sparse models○ Song Han, et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network.”

ISCA 2016

● Accuracy and resource trade-off○ Han, Seungyeop, et al. "MCDNN: An Approximation-Based Execution Framework for Deep

Stream Processing Under Resource Constraints." MobiSys (2016).

Page 27: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Serving system

Page 28: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Serving system● Goals:

○ High flexibility for writing applications○ High efficiency on GPUs○ Satisfy latency SLA

● Challenges○ Provide common abstraction for different frameworks○ Achieve high efficiency

■ Sub-second latency SLA that limits the batch size■ Model optimization and multi-tenancy causes long tail

Page 29: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

FrontendFrontend

Nexus: efficient neural network serving system…requests

unit query

Frontend

30

BackendM2

GPU

scheduler

Backend

GPU

scheduler

M3M1

BackendM2

GPU

scheduler

BackendM2

GPU

scheduler

Backend

GPU

scheduler

M4M3

Remarks• Frontend runtime library allows

arbitrary app logic• Packing models to achieve

higher utilization• A GPU scheduler allows new

batching primitives• A batch-aware global scheduler

allocates GPU cycles for each model

Global Scheduler

App 1 App 2 App 3 Control flow

Nexus RuntimeLoad Balancer Dispatch Approx. lib

Application Logic

Page 30: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Flexibility: Application runtimeclass ModelHandler: # return output future def Execute(input)

class AppBase: # return ModelHandler def GetModelHandler(framework, model, version, latency_sla) # Load models during setup time, implemented by developer def Setup() # Process requests, implemented by developer def Process(request)

31

Async RPC, execute the model remotely

Send load model request to global scheduler

Page 31: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Application example: Face recognitionclass FaceRecApp(AppBase): def Setup(self):    self.m1 = self.GetModelHandler(“caffe”, “vgg_face”, 1, 100) self.m2 = self.GetModelHandler(“mxnet”, “age_net”, 1, 100)

def Process(self, request): ret1 = self.m1.Execute(request.image) ret2 = self.m2.Execute(request.image)    return Reply(request.user_id, ret1[“name”], ret2[“age”])

32

Load model from different framework

Execute concurrently on remote GPUs

Force to synchronize when accessing future data

Page 32: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Application example: Traffic Analysisclass TrafficApp(AppBase): def Setup(self):    self.det = self.GetModelHandler(“darknet”, “yolo9000”, 1, 300)    self.r1 = self.GetModelHandler(“caffe”, “vgg_face”, 1, 150)    self.r2 = self.GetModelHandler(“caffe”,“googlenet_car”, 1, 150)

def Process(self, request): persons, cars = [], [] for obj in self.det.Execute(request.image): if obj[“class”] == “person”: persons.append(self.r1.Execute(request.image[obj[“rect”]]) elif obj[“class”] == “car”: cars.append(self.r2.Execute(request.image[obj[“rect”]])    return Reply(request.user_id, persons, cars)

33

Page 33: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

High Efficiency

• For high request rate, high latency SLA workload, saturate GPU efficiency by using large batch size

34

Request rate

Latency SLA

low

high

highlow

Workload characteristic

batch size >= 32

VGG16 throughput vs batch size

GPU saturate region

Page 34: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

High Efficiency

• Suppose we can choose a different batch size for each op (layer), and allocate dedicated GPUs for each op.

35

Request rate

Latency SLA

low

high

highlow

Workload characteristic

...

op1 op2 opn

GPU

GPU

GPU

VGG16 conv2_1 (winograd) VGG16 fc6

batch size 16

Page 35: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Split Batching

•  

36

Page 36: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Split Batching

37

VGG16 optimal batch size for each segment for latency SLA 15ms

Batch size 3 for entire model

40%

13%

Page 37: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

High Efficiency•  

38

Request rate

Latency SLA

low

high

highlow

Workload characteristic

batching

exec

exec

time

batch k

batch k+1

GPU is idle during this period

Execute multiple models on one GPU

Page 38: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Execute multiple models on a single GPU

39

A1 A2 A3 A4 A5

batch k

Model A requests

GPU execution A1 A2 A3 A4 A5

batch k+1

worst latency

Page 39: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Execute multiple models on a single GPU

40

A1 A2 A3 A4 A5

batch k

Model A requests

GPU execution A1 A2 A3 A4 A5

batch k+1

A1 A2 A3 A4 A5B1 B2 B3 B4 B5

Model A requests

GPU execution

Model B requests

A1 B1 ...

Operations of model A and B interleave

worst latency

worst latency

Next batch

A5

Page 40: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Execute multiple models on a single GPU

41

A1 A2 A3 B1 B2 B3 B4 B5

Model A requests

GPU execution

Model B requests

...A4 A5 A1 A2 A3 A4 A5

batch kA batch kBbatch kA+1

worst latency

Use larger batch size as latency is reduced and predictive

Page 41: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

High EfficiencySolution depends

• If saturate GPU in temporal domain due to low latency: allocate dedicated GPU(s)

• If not: can use multi-batching to share GPU cycles with other models

42

Request rate

Latency SLA

low

high

highlow

Workload characteristic

Page 42: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Prefix batching for model specialization

• Specialization (long-term / short-term) re-trains last a few layers

43A new model

Page 43: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Prefix batching for model specialization

• Specialization (long-term / short-term) re-trains last a few layers

• Prefix batching allows batch execution for common prefix

44A new model

Different suffixes execute individually

Common prefix can be batched together

Page 44: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Meet Latency SLA: Global scheduler

• First apply split batching and prefix batching if possible• Multi-batching: bin-packing problem to pack models into GPUs• Bin-packing optimization goal

• Minimize the resource usage (number of GPUs)

• Constraint• Requests have to be served within latency SLA

• Degrees of freedom• Split workloads into smaller tasks• Change the batch size

45

Page 45: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Best-fit decreasing algorithms

 

46

Page 46: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Best-fit decreasing algorithms

 

47

Bin to fit in execution cycles

Page 47: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Best-fit decreasing algorithms

 

48

Bin to fit in execution cycles

...

sort

All residue workloads

Page 48: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink

Reference● Kim, Yong-Deok, et al. "Compression of deep convolutional neural networks for fast and low

power mobile applications." ICLR (2016).● Song Han, et al. "Deep Compression: Compressing Deep Neural Networks with Pruning,

Trained Quantization and Huffman Coding." ICLR (2016).● Romero, Adriana, et al. "Fitnets: Hints for thin deep nets." ICLR (2015).● Mohammad Rastegari, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional

Neural Networks." ECCV (2016).● Crankshaw, Daniel, et al. "Clipper: A Low-Latency Online Prediction Serving System." NSDI

(2017).