dnpu: an 8.1tops/w reconfigurable cnn-rnn processor for ... · pdf file14.2: dnpu: an...
TRANSCRIPT
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 1 of 39
DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks
Dongjoo Shin, Jinmook Lee, Jinsu Lee,and Hoi-Jun Yoo
Semiconductor System LaboratorySchool of EE, KAIST
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 2 of 39
Outline Introduction Processor Architecture Key Features
1. Heterogeneous architecture2. Mixed workload division method3. Dynamic fixed-point with on-line adaptation4. Quantization-table-based multiplier
Implementation Results Conclusion
Configurability
Low Power
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 3 of 39
Why Deep Learning? Overwhelming performance
Unlimited applicability– Image classification, action recognition, sign recognition…– Super resolution, depth map generation, …– Translation, natural language processing, …– Text generation, image captioning, composition, …
2011 2012 2013 2014 Human 2015 20160
5
10
15
20
25 Computer VisionDeep LearningHuman
2010
Introduction
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 4 of 39
Types of Deep Learning
Hidden
ConvolutionPooling
Fully-connected LayerChracteristic Convolutional layer Feedback path,
internal state
Major Application
Layers
Simple classification, hand written letter
recognition ...
Vision processing,face recognition, image
classification, ...
Sequential data processing,
translation, speech recognition, ...
3~10 layers 5~100s layers 3~5 layers
(multi-layer perceptron) (convolutional) (recurrent)
Introduction
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 5 of 39
Why both CNN & RNN? CNN: Visual feature extraction and recognition
– Face recognition, image classification… RNN: Sequential data recognition and generation
– Translation, speech recognition… CNN + RNN: CNN-extracted features RNN input
Previous works– Optimized for convolution layer only: [1], [2], [3]– Optimized for FC layer and RNN only: [4]
Introduction
[1] Y. Chen, ISSCC 2016
Two dogs playing in the water
Image Captioning
[2] B. Moons, SOVC 2016 [3] J. Sim, ISSCC 2016 [4] S. Han, ISCA 2016
ActionRecognition
CNN: Visual feature extractionRNN: Sentence generation
CNN: Visual feature extractionRNN: Temporal information recognition
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 6 of 39
Deep Learning-dedicated SoC: DNPUIntroduction
ControlALU
Cache
ALU ALU
ALU
Low compute density
High programmability
High compute density
Floating point data type
Fixed computation pattern
Dedicated memory arch.
Complex control logic Parallel Computation
Dedicated data type
Higher Energy Efficiency (Operations/W)
Parallel Processing
Highly DedicatedArchitecture
Higher Programmability
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 7 of 39
Deep Learning-dedicated SoC: DNPUIntroduction
ControlALU
Cache
ALU ALU
ALU
Low compute density
High programmability
High compute density
Floating point data type
Fixed computation pattern
Dedicated memory arch.
Complex control logic Parallel Computation
Dedicated data type
Higher Energy Efficiency (Operations/W)
Parallel Processing
Highly DedicatedArchitecture
Higher Programmability
Deep learning itself has high adaptability for various applications
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 8 of 39
Design Challenges Heterogeneous characteristics
– Conv. Layer (CNN): Computation >> Parameter– FC Layer (CNN), RNN: Computation ~= Parameter
Introduction
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 9 of 39
Design Challenges Acceleration of Conv. with FC-RNN dedicated HW
– High memory transaction– Low PE utilization– Mismatch of the computational patterns
Acceleration of FC-RNNs with Conv. dedicated HW
– Cannot exploit reusability (image, kernel, convolution)– Cannot achieve required throughput
Introduction
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 10 of 39
Design Challenges Various configurations of layers
– Image size, channel size, kernel (filter) size
Example: VGG-16 Convolution Layers
Introduction
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 11 of 39
Design Challenges Large amount of data & massive MACs
Introduction
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 12 of 39
Proposed Processor: DNPU For High Configurability,
1. Heterogeneous Architecture2. Mixed Workload Division
For Low-power Operation,3. Dynamic Fixed-point with On-line Adaptation4. Quantization-table-based Multiplication
An 8.1TOPS/W CNN-RNN Processor for General-purpose DNNs
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 13 of 39
Overall ArchitectureFeature 1. Heterogeneous Architecture
Convolution Cluster 1
Convolution Cluster 3
Aggregation Core
Custom Instruction Set-based Controller
Ext. IF
Ext. IF
Instruction Memory Data Memory
Input Buffer
Quantization Table-basedMatrix Multiplier
Accumulation Registers
Activation Function LUT
Weight BufferConvolution Cluster 0
Convolution Cluster 3
Convolution layer (CNN)
Fully-connected layer (CNN)
Recurrent neural network
HeterogeneousArchitecture
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 14 of 39
Convolution ProcessorFeature 1. Heterogeneous Architecture
Distributed memory-based architecture
Data NoC
Instruction Network
Partial Sum Network
ImageMemory(16KB)
WeightMem.(1KB)
PE Array(12x4)
Partial Sum Registers
FSMCtrlr.
ImageMemory(16KB)Pooling
Relu
FSMCtrlr.
Partial Sum Registers
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
Conv.Core
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 15 of 39
FC Layer and RNN Hardware Sharing FC layer - Mapping to matrix multiplication
RNN – Mapping to matrix multiplication
Feature 1. Heterogeneous Architecture
om = W0m i0 + W1m i1 + + Wnm in + bm
i0 i1 in b0W10
1
Wn0
b1W11
Wn1
bmW1m
Wnm
0 1 2 n
0 1 om
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 16 of 39
Limited On-chip Memory Size Conv. operation with limited on-chip memory
Feature 2. Mixed Workload Division Method
Convolution Operation
Workload should be dividedto reduce the required on-chip memory size
On-chip memory size
100s KB ~ 1s MB
On-chip memory
On-chip processing elements
Trade-off
1 2
3
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 17 of 39
Image Division Workload division with image direction
Feature 2. Mixed Workload Division Method
Ci
W''
H'
Co
Wf
HfH'
W'
W x H = W' x H' x Div.#
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 18 of 39
Channel Division Workload division with channel direction
Feature 2. Mixed Workload Division Method
i o
f
f
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 19 of 39
Mixed Division Workload division with both directions
Feature 2. Mixed Workload Division Method
Input Image i i i
Weight
Output Image
f f i o
o o o
i i i
f f i o
o o o
i i i
f f i o
o o o
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 20 of 39
Variation Through Layers VGG-16 convolution layer example
Feature 2. Mixed Workload Division Method
Large Number of Channels
Large Image Size
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 21 of 39
Off-chip Access Analysis Mixed division can take lower points
Feature 2. Mixed Workload Division Method
1 2 3 4 5 6 7 8 9 10 11 12 13
10
20
30
40
50
Convolutional Layer Number
Off-
chip
Mem
ory
Acc
ess
(MB
)
Image DivisionChannel DivisionMixed Division
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 22 of 39
Layer-by-Layer Data Distribution Data distribution in each layer
Floating-point implementation– Large data range, but high cost
Fixed-point implementation– Low cost, but limited data range
Feature 3. Dynamic Fixed-point with On-line AdaptationD
ensi
ty
Max
. Val
ue
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 23 of 39
Each layer has different WL and FL Having floating-point characteristic via layers
WL and FL are fixed in a same layer Enabling fixed-point operation
Layer-by-Layer Dynamic Fixed-point
FL: 1
FL: 2WL: 4
FL: -2
WL: 8
WL: 8
Fraction Point
Conv. Layer 1 – WL: 8, FL: 1
Conv. Layer 2 – WL: 8, FL: 2
Conv. Layer 3 – WL: 4, FL: -2
Fraction Point
Word Length
Integer Bits Fractional Bits
Word Length(WL: 4 ~ 16)
Fraction Length(FL: -15 ~ 16)
Example)
Feature 3. Dynamic Fixed-point with On-line Adaptation
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 24 of 39
Off-line learning-based FL selection
– Off-line (off-chip) learning-based FL selection– FL is trained to fit with given image data set– Selected FL is used for every image at run time
Previous Approach
Image Data Set
Fraction Length (FL) SetsSet1 – L1: 0, L2: 0, L3:-4 …Set2 – L1: 1, L2: -1, L3: -5 …Set3 – L1: 1, L2: -3, L4: -6 …
Finding FL setwhich shows the minimum error
with given image data set
L: LayerL1
L2L3
Feature 3. Dynamic Fixed-point with On-line Adaptation
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 25 of 39
On-line adaptation-based FL selection
– On-line (on-chip) learning-based FL selection– FL is dynamically fit to current input image No off-chip learning, lower required WL
Proposed On-line Adaptation
Threshold
Feature 3. Dynamic Fixed-point with On-line Adaptation
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 26 of 39
Performance Comparisons Image classification results
2 4 86 10 12 140
0.2
0.6
0.4
0.8
Proposed Dynamic-Fixed-Point with AdaptationDynamic Fixed-Point Fixed-Point
Used Model: VGG-16Used Dataset: ImageNet
Baseline: 32-bit floating point
Feature 3. Dynamic Fixed-point with On-line Adaptation
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 27 of 39
LUT-based Reconfigurable Multiplier
, = + − 1, + − 1 × ( , )50,176 times multiplications
for each kernel weightInput image(I): 224 X 224
Kernel(k): 3 X 3@
Feature 3. Dynamic Fixed-point with On-line Adaptation
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 28 of 39
LUT-based Reconfigurable MultiplierFeature 3. Dynamic Fixed-point with On-line Adaptation
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 29 of 39
Weight QuantizationFeature 4. Quantization Table-based Multiplier
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 30 of 39
Quantization Table Construction Pre-computation with each quantized weight
Feature 4. Quantization Table-based Multiplier
Input RegistersI1
0 1 2 15w0 w1 w2 w15
I0
IndexWeight
I7
Weight Mapping Table
I0 x w0
I0 x w1
I0 x w15
Q-table for I0
I1 x w0
I1 x w1
I1 x w15
Q-table for I1
I7 x w0
I7 x w1
I7 x w15
Q-table for I7
Multiplications between input and quantized weights
Multiplication results arealso quantized to 16 values.
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 31 of 39
Multiplication with Quantization Table Decode index to load the pre-computed result
Feature 4. Quantization Table-based Multiplier
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 32 of 39
Quantization Table-based Multiplier Off-chip memory BW
– Quantization: 16-bit weight 4-bit index– Zero skipping: Skip weight load for zero input
Multiplications– 99% of multiplications Table look-up
Feature 4. Quantization Table-based Multiplier
Quantization(4-bit)
Quantization with Zero Skipping
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 33 of 39
Chip Photograph and SummaryImplementation Results
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 34 of 39
Measurement ResultsC
lock
Fre
quen
cy (M
Hz)
TOPS
/W
Pow
er (m
W)
Implementation Results
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 35 of 39
Measurement Results Dynamic fixed-point with on-line adaptation
Implementation Results
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 36 of 39
Demonstration VideoImplementation Results
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 37 of 39
Performance Comparison Comparison with convolution processors
[1] Y. Chen, et al., ISSCC 2016 [2] B. Moons, et al., SOVC 2016
Implementation Results
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 38 of 39
Performance Comparison Comparison with FC layer processor
[4] S. Han, et al., ISCA 2016
Implementation Results
Proposed work is the only one which can support both CNN and RNN
14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 39 of 39
Conclusion An Energy Efficient CNN-RNN Processor SoC
is proposed for General-purpose DNNs Key Features:
– Support both CNN and RNN– Mixed workload division method– Dynamic fixed-point with on-line adaptation– Quantization table-based multiplier
DNPU: An 8.1TOPS/W reconfigurable CNN-RNN Processor SoC