knowledgenet: disaggregated and distributed training and … · 2019. 5. 22. · background deep...

Virtualized Infrastructures, Systems, & Applications

KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep

Neural Networks

Saman Biookaghazadeh, Yitao Chen, Kaiqi Zhao, Ming Zhao

Arizona State University

http://visa.lab.asu.edu

Background

� Deep neural networks (DNNs) haven shown great potentialo Solving many challenging problemso Relying on large systems to learn on big data

� Limitations of centralized learningo Cannot scale to handle future demands

• “20 billion connected devices by 2020”

• Real-time learning/decision making

o Cannot exploit the capabilities ofthe increasingly powerful edge• Multiprocessing, accelerators

2Ming Zhao

Objectives

� Distributed learning across heterogeneous resourceso Including diverse and potentially weak resourceso Across edge devices and cloud resources

� Limitations of existing solutionso Data parallelism does not work

for heterogeneous resourceso Conventional backpropagation

based training makes modelparallelism difficult

o Edge devices cannot train large models

o Model compression works onlyfor inference, not training

3

Proposed approach

� Using synthetic gradients to achieve model parallelismo Split a large DNN model into many small modelso Deploy small models on distributed and potentially weak resourceso Use synthetic gradients to decouple the training of the small

models

4

Forward path

Backward path

● To training a neural network model ○ forward path: calculate loss of the class prediction○ backward path: calculate the gradients of loss w.r.t the parameters

● Need tens of thousands iterations to achieve a high accuracy

● Layers are coupled and training are sequential

● Limited scalability in training

Conv

1_1

Conv

1_2

Pool

1

Conv

2_1

Conv

2_2

Pool

2

Conv

3_1

Conv

3_2

Conv

3_3

Pool

3

FC 4

_1

FC 4

_2

FC 4

_3 Prediction y

Conventional training

Tightly coupled

FC 4

_1

FC 4

_2

FC 4

_3Prediction yh(4_1) h(4_2)

h(i): layer output of the ith layerδ(i): error of the ith layer

δ(4_3)δ(4_2)

Pool

3

δ(4_1)

h(3)

Training with synthetic gradients

6

M3

h(3_2) h(3_3)

h(i): layer output of the ith layerδ(i): error of the ith layer

δ(3_4)δ(3_3)Conv

3_1

Pool

3

Conv

3_3

Conv

3_2

● Use another small neural network to predict the erroràsynthetic gradients

● Use synthetic gradients to train the small model instead of waiting for the actual gradients from backpropagation

h(3_1)

δ(3_2)

δ(4_1) (synthetic error)

M4

FC 4

_1

δ(4_1) - δ(4_1)Minimize

Decouple the layers

Proposed approach

� Using knowledge transfer to improve training on weak resourceso Train a large, teacher model in

the cloudo Train many small, student

models on the deviceso Exploit the knowledge of the

teacher to train the studentso Improve the accuracy and

convergence speed of the students

7Ming Zhao

Teacher Student

Knowledge transfer

8

conv1

conv2

conv3

conv4

conv5

conv6

fc1

softmax

conv1

conv2

conv3

fc1

softmax

RMSE

RMSE

RMSE

RMSE

cross entrop

correct labels

Input data

Teacher network

Student network

Minimize squared difference between layer outputs (RMSE) to allow student to learning from teacher representations

1st -> 1st2nd -> 4th3rd -> 6thSoftmax -> Softmax

Preliminary results

� Using synthetic gradients to achieve model parallelismo Model: a simple 4-Layer

• One convolution layer and three fully-connected

o MNIST Dataseto 500K Iteration of training

9

Backpropagation

Synthetic gradients

4-layer model

0.984 0.977

8-layer model

0.992 0.924

VGG16 0.994 0.845

Training accuracy

0.7

0.8

0.9

1

1.1

0 50K 100K 150K 200K 250K 300K 350K 400K 450K 500K

Accu

racy

Iterations

sg validationbp validation

Preliminary results

� Using knowledge transfer to improve training on weak resourceso Teacher Model: VGG-16 (8.5M parameters)o Student Model: miniaturized VGG-16 (3.2M parameters)o Training student for 100 Epochs

10

Accuracy

Teacher 76.88%

Student (dependent) 74.12%

Student (independent) 61.24%

“Are Existing Knowledge Transfer Techniques Effective For Deep Learning on Edge Devices?” R. Sharma, S. Biookaghazadeh, and M. Zhao, EDGE, 2018

Conclusions and future work

� Rethink DNN platforms to handle future learning needso Utilize heterogeneous resources to train DNNso Distribute learning across edge and cloud

� Achieve efficient model parallelismo Use synthetic gradients to decouple the training of distributed

modelso Need to further improve its accuracy for complex models

� Overcome the resource constraints of edge deviceso Use knowledge transfer to help train on-device modelso Need to further study its effectiveness under scenarios with

limited data and limited supervision

11Ming Zhao

Acknowledgement

� National Science Foundationo GEARS project: CNS-1629888o CNS-1619653, CNS-1562837, CNS-

1629888, CMMI-1610282, IIS-1633381

� VISA Lab @ ASUo Saman, Yitao, Kaiqi, and others

� Thank you!o Questions and suggestions?o Come to our poster at Happy

Hour!

12Ming Zhao

knowledgenet: disaggregated and distributed training and … · 2019. 5. 22. · background deep...

Documents