dnn engine: a 16nm sub-uj deep neural network ......dnn engine: a 16nm sub -uj dnn inference...
TRANSCRIPT
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator
for the Embedded Masses
Paul N. Whatmough1,2
S. K. Lee2, N. Mulholland2, P. Hansen2, S. Kodali3, D. Brooks2, G.-Y. Wei2
1ARM Research, Boston, MA2Harvard University, Cambridge, MA3Princeton University, NJ
The age of deep learning
More Compute
Bigger Data
Deeper Nets
The embedded masses
3
Interpret noisy real-world data from sensor-rich systems
Keep inference at the edge: always-on sensing
Large memory footprint and compute load
DNN inference at the edge
Normalizemean=0, variance=1
Pre-Processing Feature Extraction Classifier
Trained weights
Features
FFT, MFCC, Sobel, HOG, etc
Fully-connected Deep neural network
(DNN)
Norm. DataRaw Data Classes
DNN classifier design flow
Data- Training, validation, test
Training- Optimize hyper-parameters- Minimize size and test error
Quantization- Fixed-point 8-bit / 16-bit
Inference- CPU, DSP, GPU, Accelerator
5
[Reagen et al., ISCA 2016]
DNN ENGINE accelerator architecture
Inpu
t Vec
tor
Out
put C
lass
es
DNN ENGINE
Hidden Layers
Outputs
+DNN Topology
+Weights
Inputs
Sensor Data
WeightNeuron Classification
Argmax
Parallelism and data reuse
Sparse data and small data types
Algorithmic resilience
Outline
Background and motivation
DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience
Measurement Results
Summary
7
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
Hidden Layers
Input Layer Output Layer
8
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
NeuronWeight
9
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
SIMD Window
10
Process a group of neurons in parallel
Fully-connected DNN graph
Inpu
tVec
tor
Out
put C
lass
es
Weights
11
Re-use of activation data across neurons
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
12
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
13
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
14
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
15
Add bias and apply activation function
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
16
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
17
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
18
Fully-connected DNN graph
Inpu
t Vec
tor
Out
put C
lass
es
19
Balancing efficiency and bandwidthNo Data Reuse
20
Parallelism increases throughput and data reuse
Balancing efficiency and bandwidthBandwidth
too high
Parallelism increases throughput and data reuse- But also increases Memory Bandwidth demands
No Data Reuse
21
Balancing efficiency and bandwidth
Parallelism increases throughput and data reuse- But also increases Memory Bandwidth demands
8-Way SIMD is efficient at reasonable memory BW- 10x Activation reuse, with 128b AXI channel
Bandwidth too high
No Data Reuse
128b
/cyc
le
22
DNN ENGINE micro-architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host
Pro
cess
or
Writeback
Data Read
W-MEM
23
8-Way SIMD accelerator architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host
Pro
cess
or
Writeback
Data Read
W-MEM
DNN ENGINE micro-architecture
Host Processor loads configuration and input data
24
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host
Pro
cess
or
Writeback
Data Read
W-MEM
DNN ENGINE micro-architecture
Accumulate products of Activation and Weights
25
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host
Pro
cess
or
Writeback
Data Read
W-MEM
DNN ENGINE micro-architecture
Add bias, apply ReLU activation and writeback
26
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host
Pro
cess
or
Writeback
Data Read
W-MEM
DNN ENGINE micro-architecture
IRQ to host, which retrieves output data
27
Outline
Background and motivation
DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience
Measurement Results
Summary
28
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
29
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
30
Discard small activation data
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
31
Dynamically prune graph connectivity
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
32
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
33
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
34
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
35
Discard small activation data
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
36
Dynamically prune graph connectivity
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
37
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
38
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
39
Exploiting sparse data
Inpu
t Vec
tor
Out
put C
lass
es
40
DNN ENGINE micro-architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMA
Host
Pro
cess
or
Writeback
Data Read
W-MEM
41
DNN ENGINE micro-architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMAIndex
Host
Pro
cess
or
Writeback
SKIPData Read
W-MEM
42
DNN ENGINE micro-architecture
MAC DatapathWeight Load
NBUF
IPBUF
XBUF
Data In
Data Out
CFG
ActivationDMAIndex
Host
Pro
cess
or
Writeback
SKIPData Read
W-MEM
43
W-MEM supports 8b or 16b data types
Outline
Background and motivation
DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience
Measurement Results
Summary
44
MAC DatapathWeight Load
RZNBUF
IPBUF
XBUF
RZ
Data In
Data Out
CFG
ActivationDMAIndex
Host
Pro
cess
or
Writeback
SKIPData Read
W-MEM
DNN ENGINE micro-architecture
45
[Whatmough et al., ISSCC’17]
Timing error tolerance
10-1
10-3
10-5
10-7
MNIST @ 98.36% AccuracyMemory Datapath Combined
Tim
ing
Viol
atio
n R
ate
TC SM BM TC SM TB TC ALL
> 1,
000,
000
X
46
[Whatmough et al., ISSCC’17]
Outline
Background and motivation
DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience
Measurement Results
Summary
47
16nm SoC for always-on applications
48
16nm SoC for always-on applications
49
Measured energy for always-on applications
Nominal Vdd / signoff Fmax
Energy varies with application Exponential with accuracy
On-chip memory critical A few KBs from DRAM >1uJ Constrains model size
50
667MHz / 0.8V
2.5x energy0.3% accuracy
Measured energy improvement
Joint architecture and circuit optimizations
Typical silicon at room temp
Measurements demonstrate 10x energy reduction 4x throughput increase
51
10x
667MHz / 0.8V 667MHz / 0.67V
State of the art classifiers
52
Dedicated NN accelerators Different performance points Different technologies
Accuracy critical metric Linear classifier 88% Exponential cost
The next 10x improvement Algorithm innovations?
Summary
Inference moving from cloud to the device: always-on sensing DNN ENGINE architecture optimizations
- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience
16nm test chip measurements- Critical to store the model in on-chip memory- 10x energy and 4x throughput improvement- 1uJ per prediction for always-on applications
We are grateful for support from DARPA CRAFT and PERFECT projects
53