efficiently training cnn with fpga - xilinx
TRANSCRIPT
© Copyright 2018 Xilinx
Presented By
Yu Wang
Associate Professor, Dept. of E.E., Tsinghua University
Co-founder of DeePhi Tech.
Efficiently Training CNN with FPGA
© Copyright 2018 Xilinx
An Era of Deep Learning
2
ADAS
Neural Network
Cloud service
Smart phone Camera
Everything & Everywhere
© Copyright 2018 Xilinx
Two Stages in Deep Learning: Training and Inference
3
Rocket: Deep learning
Fuel:Big data
Engine:Computing platform
According to Prof. Andrew Ng
Training How accurate the model can be
Inference How many applications can use DL
84.70%
88.30%
93.30%96.40%
97.00%
Top-5 Accuracy
Server
Client
Client
Client
© Copyright 2018 Xilinx
Efficient inference has been highly focused on
4
Performance
100TOPS
10TOPS
1TOPS
100GOPS
10GOPS
0.001 0.01 0.1 1 10 100 1000
NVIDIA
Xavier
Cambricon
DianNao
Mobileye
EyeQ4
ASIC
FPGA
GPU
Power (W)
DeePhi
Tingtao
地平线旭日&征程
A latest version can be found at: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
FPGA can achieve similar energy efficiency to ASIC and GPU
© Copyright 2018 Xilinx
What about training?
˃ Training is also heavy work!
5
Sampleslabels
loss
Update
Workload = #Samples × (Inference+Update) × #Iterations
~106 ~1010 FLOPs ~100
× 4 × 60hr
NVIDIA Titan Xp (TDP 250W)
Inference
Training a VGG model on ImageNet dataset:
© Copyright 2018 Xilinx
We also need energy efficient training
˃ For cloud services:
The building cost of a data center is about 10000-20000$/kW1
6
• For end applications:• changing environment -> changing model
• Platform power limitation
• Car: 10-100w
• Satellite: hard to dissipate heat
cloud
training
unstable privacy
localtraining
backend
traininglocal
training1 Kontorinis V, Zhang L E, Aksanli B, et al. Managing distributed ups energy for effective power capping in data centers[C]//Computer Architecture (ISCA), 2012 39th Annual
International Symposium on. IEEE, 2012: 488-499.
© Copyright 2018 Xilinx
What does training do?
˃ Stochastic Gradient Descent (or other variants)
7
Inference
y=W·x
Calculate the error of output
δy
Update
δW=δy·x’
W=W+λ·δW
Back-propagate
δx=δy·W’
• Similar to inference, the main workload of training is still sum of product
• Maybe we can do the same things as for inference
© Copyright 2018 Xilinx
What we have done with inference
8
Core API
Driver
Runtime
Loader
Profiler
DPU Platform
DL Framework
Compression
Pruning Quantization
Compilation
Compiler Assembler
ASIC
Software-HardwareCo-design
© Copyright 2018 Xilinx
Techniques for inference: Software-Hardware Co-design
˃ Quantization
9
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
GoogLeNet VGG-16 SqueezeNet VGG-CNN-F
fp-32 16-bit 8-bit 6-bit
Negligible accuracy loss with 8-bit fixed-point weight & activation1 Save FPGA resources
1 Guo K, Han S, Yao S, et al. Software-Hardware Codesign for Efficient Neural Network Acceleration[J]. IEEE Micro, 2017, 37(2): 18-25.
© Copyright 2018 Xilinx
Techniques for inference: Software-Hardware Co-design
˃ Sparsification1
10
Negligible WER increase with10% weights
6x acceleration withcustomized hardware
1 Han S, Kang J, Mao H, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga[C]// ISFPGA, 2017: 75-84.
© Copyright 2018 Xilinx
Q: Can we take everything from inference to training?
11
A: Things are different!
© Copyright 2018 Xilinx
Challenges
˃ Quantization for training is more difficult than for inference1
12
Baseline:36.7/15.3
High accuracy v.s. Narrow bitwidth
1 Wu S, Li G, Chen F, et al. Training and Inference with Integers in Deep Neural Networks[J]. arXiv preprint arXiv:1802.04680, 2018.
© Copyright 2018 Xilinx
Challenges
˃ Pruning is applied when the network is well trained1
13
Cannot be accelerated
utilizing sparsity
Can be accelerated
utilizing sparsity
Using sparsity only brings more workload to training
1 Han S, Pool J, Tran J, et al. Learning both weights and connections for efficient neural network[C]//NIPS. 2015: 1135-1143.
training pruning fine-tune
75hr 173hr
Training
Sparse
Network
training
75hr
Normal
Training
© Copyright 2018 Xilinx
Opportunity
˃ Previous hardware only utilizes operand sparsity
C=ΣAB (A and B can be sparse)
Apply to inference and back-propagate phases
14
ESE1 SCNN2
1 Han S, Kang J, Mao H, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga[C]// ISFPGA, 2017: 75-84.2 Parashar A, Rhu M, Mukkara A, et al. Scnn: An accelerator for compressed-sparse convolutional neural networks[C]//ISCA, 2017: 27-40.
• We should also utilize result sparsity• C=ΣAB (C is sparse)
• Apply to update phase: δW=δy·x’
© Copyright 2018 Xilinx
Our Solution
1. A training process with fixed-point gradient, activation, weight
32-bit float multiplication -> 8-bit fixed-point multiplication
8x energy saving
2. Pruning before the training process converge
Reduce floating point training process to 1/3
3. Customized FPGA accelerator design to accelerate the training of the sparse
model.
Potentially more than 3x speed up for the largest layers
15
• Data pre-processing
• Floating-point training
CPU/GPU
• Pruning• Encode
sparsity• Quantization
CPU
• Train the pruned network
FPGA
© Copyright 2018 Xilinx
Training with fixed-point data
16
˃ Replace all the floating point data with
fixed-point data
˃ Short bit-width for MAC: reduce
computation cost
˃ Long bit-width for weight storage: able
to accumulate small gradientsint 32
int 8
𝑊𝑖
FP
WG
EB
𝜕𝐸
𝜕𝑊𝑖
𝑦𝑖−1𝜕𝐸
𝜕𝑦𝑖−1
𝑦𝑖𝜕𝐸
𝜕𝑦𝑖
Layer
i
shift&cut
shift&cut shift&cut
© Copyright 2018 Xilinx
Pruning before convergence
˃ 2D-kernel-wise pruning for CONV layers
˃ Normal pruning for FC layers
˃ For the simplicity of hardware and sparse coding style
17
input
neuron 0input
neuron M
input
channel 0input
channel M
output
neuron 0
output
neuron N
output
channel 0
output
channel N
M×N fully connected layerM×N channel convolution
layer with 3×3 kernels
© Copyright 2018 Xilinx
Pruning before convergence
˃ Dataset: CIFAR-10; Network: VGG-11
˃ Prune standard: <1% accuracy loss
18
0
10
20
30
40
50
60
70
80
90
100
30 60 90 120 150 180 210 240
Ra
tio
of P
run
ed
We
igh
ts (
%)
Epoch when apply pruning
conv1 conv2 conv3_1 conv3_2 conv4_1 conv4_2
conv5_1 conv5_2 dense6 dense_7 dense8
More than half of the weights can be pruned
© Copyright 2018 Xilinx
Pruning and Quantization
˃ Dataset: CIFAR-10; Network: VGG-11
˃ Weight: 32/24/16 bit; Activation: 8 bit
˃ 24bit weight and pruning as early as 60 epochs is enough for a good training result
© Copyright 2018 Xilinx
Hardware design
20
CPU
FPGA
Pre Stream Post Stream
PE 1
PE 2
PE N
DRAM
contr
olle
r
interconnect
PEdata
buf
weig
ht B
uf
AGU
inte
r-P
E
connection
Pre Stream
res buf
MAC array
Post Stream
Y buf
X buf
CONV/FC cnt
CONV/FC cnt
MU
X
Pre
Str
eam
Contr
olle
r
……
FP: ReLU/Pooling
BP: Sum
BP: ReLU/Pooling/
FP&BP: CONV/FC
© Copyright 2018 Xilinx
Parallelism
˃ Batch is welcomed for training
Parallel batch process
˃ Input / Output channel
! Affected by sparsity
Duplicate input channel and parallelize output channel
˃ Feature map pixel / Convolution kernel
! Loop size varies greatly
Configurable parallelism
21
Inference: 𝐹𝑗𝑙 =
𝑖=0
𝑀−1
conv2d 𝐹𝑖𝑙−1, 𝐾𝑖𝑗 + 𝑏𝑗 𝑗 = 0, 1, … , 𝑁 − 1
Update: 𝑑𝐾𝑖𝑗 = conv2d 𝐹𝑖𝑙−1, 𝑑𝐹𝑗
𝑙 𝑗 = 0,… , 𝑁 − 1𝑖 = 0,… ,𝑀 − 1
large
small
© Copyright 2018 Xilinx
Configurable parallelism
˃ Each PE locally do multiply and accumulate
˃ Inference: 4 PEs process 4 output pixels for 2D convolution in parallel
Data is shared within a group of PE to utilize the data locality
22
* =
© Copyright 2018 Xilinx
Configurable parallelism
23
• Each PE locally do multiply and accumulate
• Inference: 4 PEs process 4 output pixels for 2D convolution in parallel• Data is shared within a group of PE to utilize the data locality
• Update: same with inference• Waste the computation units!
××
* =
© Copyright 2018 Xilinx
Configurable parallelism
˃ Each PE locally do multiply and accumulate
˃ Inference: 4 PEs process 4 output pixels for 2D convolution in parallel
Data is shared within a group of PE to utilize the data locality
˃ Update: 4PEs process a part of the gradient in parallel
Partial gradients are summed in a successive module
24
* =
© Copyright 2018 Xilinx
PE structure to support result sparsity
˃ Random access buffer for both input and output data
˃ Switch and MUX for index and address generation unit
25
XBuf
YBuf
Idx addr Idx addr Idx addr
Counter
Conv/FCInputAddr
Counter
Conv/FCOutputAddr
Counter
Index X Index Y
Data BufAddr
Param BufAddr
Res BufAddr
© Copyright 2018 Xilinx
Use Compressed Sparse Block (CSB) format
˃ FC layers: split the weight matrix into blocks and encode each element in a block with
a 2D coordinate.
˃ CONV layers: we split the kernels into channel blocks and encode each 2d kernel in a
block with a 2D coordinate.
˃ Transpose by switch both the 2D coordinate in hardware and the block order in
software
28
row
block
© Copyright 2018 Xilinx
Scheduling
29
7 8 92 3
4 5 6
7 8 9
load
PE0
PE1
PE2
1 2 3 4 5 6
1
2
3
4
5
6
A
B
C
DE
F
7
8
9
Inference:Each PE calculates4 output channels
Input channel
Ou
tpu
tch
an
ne
l
3 6 92 3
4 5 6
7 8 9
load
PE0
PE1
PE2
1 4 7 2 5 8
1
4
7
2
5
8
A
B
C
DE
F
3
6
9
Back-propagate:Each PE calculates4 input channels
Input channel
Ou
tpu
tch
an
ne
l
© Copyright 2018 Xilinx
Evaluation
˃ FPGA Platform: KCU1500 development board
˃ Hardware Parameter:
250MHz clock
32 PE, each with 32 MAC unit to process a batch of 32 images
2 DDR4-2400 external memory
30
˃ Bounded by on-chip memory
˃ Accumulation buffer occupies most of the RAMs because:
Large bitwidth (32 compared with 8 for data and param)
Ping-pong strategy to cover data transfer time
Ultra-RAM may relief this problem
© Copyright 2018 Xilinx
Evaluation
˃ Performance breakdown (simulation)
31
Computation Bounded
Memory Bounded
© Copyright 2018 Xilinx
Evaluation
32
PlatformTCAD 18 FPGA 17 ESE FCNN FPT17 FPDeep Proposed GPU Titan X
XC7Z020 GX1150 KU060 Maxeler MPC-X ZU19EG VC709 KCU1500 GM200
Function Inference Inference Inference Inference Training Training Training Inference Training Training
Quantization fixed 8 fixed 8/16 fixed 12 float 32 float 32 float 32 fixed 16 fixed 8 fixed 8/24 float32
Sparsity No No Yes No No No No Yes Yes No
Performance
(GOP/s)84.3 645.25 2516 62.06 7.01 86.12
1022897.5 641.1 1252
(Per FPGA)
Power (W) 3.5 21.2 41 N.A. 27.3 14.2 32 29 150
Energy Eff.
(GOP/s/W)24.1 30.43 61.4 N.A. 0.27 6.05 31.97 30.95 22.11 8.4
˃ Performance comparison with state-of-the-art CNN inference/training accelerators
and GPU
© Copyright 2018 Xilinx
Future work
˃ More functions for training
BN layers, quantization for other optimizers
˃ Scaling up the design?
Reduce memory consumption by optimizing memory design
Lower down bandwidth requirement by optimizing scheduling
We also have HBM
˃ Improve energy efficiency?
More advanced training methods
˃ Evaluation with real training tasks
We never know if the method applies to a new application
33
© Copyright 2018 Xilinx
Adaptable.
Intelligent.