dnn engine: a 16nm sub-uj deep neural network ......dnn engine: a 16nm sub -uj dnn inference...

53
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2 , N. Mulholland 2 , P. Hansen 2 , S. Kodali 3 , D. Brooks 2 , G.-Y. Wei 2 1 ARM Research, Boston, MA 2 Harvard University, Cambridge, MA 3 Princeton University, NJ

Upload: others

Post on 18-Feb-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator

for the Embedded Masses

Paul N. Whatmough1,2

S. K. Lee2, N. Mulholland2, P. Hansen2, S. Kodali3, D. Brooks2, G.-Y. Wei2

1ARM Research, Boston, MA2Harvard University, Cambridge, MA3Princeton University, NJ

Page 2: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

The age of deep learning

More Compute

Bigger Data

Deeper Nets

Page 3: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

The embedded masses

3

Interpret noisy real-world data from sensor-rich systems

Keep inference at the edge: always-on sensing

Large memory footprint and compute load

Page 4: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN inference at the edge

Normalizemean=0, variance=1

Pre-Processing Feature Extraction Classifier

Trained weights

Features

FFT, MFCC, Sobel, HOG, etc

Fully-connected Deep neural network

(DNN)

Norm. DataRaw Data Classes

Page 5: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN classifier design flow

Data- Training, validation, test

Training- Optimize hyper-parameters- Minimize size and test error

Quantization- Fixed-point 8-bit / 16-bit

Inference- CPU, DSP, GPU, Accelerator

5

[Reagen et al., ISCA 2016]

Page 6: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN ENGINE accelerator architecture

Inpu

t Vec

tor

Out

put C

lass

es

DNN ENGINE

Hidden Layers

Outputs

+DNN Topology

+Weights

Inputs

Sensor Data

WeightNeuron Classification

Argmax

Parallelism and data reuse

Sparse data and small data types

Algorithmic resilience

Page 7: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Outline

Background and motivation

DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience

Measurement Results

Summary

7

Page 8: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

Hidden Layers

Input Layer Output Layer

8

Page 9: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

NeuronWeight

9

Page 10: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

SIMD Window

10

Process a group of neurons in parallel

Page 11: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

tVec

tor

Out

put C

lass

es

Weights

11

Re-use of activation data across neurons

Page 12: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

12

Page 13: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

13

Page 14: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

14

Page 15: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

15

Add bias and apply activation function

Page 16: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

16

Page 17: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

17

Page 18: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

18

Page 19: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

19

Page 20: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Balancing efficiency and bandwidthNo Data Reuse

20

Parallelism increases throughput and data reuse

Page 21: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Balancing efficiency and bandwidthBandwidth

too high

Parallelism increases throughput and data reuse- But also increases Memory Bandwidth demands

No Data Reuse

21

Page 22: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Balancing efficiency and bandwidth

Parallelism increases throughput and data reuse- But also increases Memory Bandwidth demands

8-Way SIMD is efficient at reasonable memory BW- 10x Activation reuse, with 128b AXI channel

Bandwidth too high

No Data Reuse

128b

/cyc

le

22

Page 23: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN ENGINE micro-architecture

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

23

8-Way SIMD accelerator architecture

Page 24: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

DNN ENGINE micro-architecture

Host Processor loads configuration and input data

24

Page 25: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

DNN ENGINE micro-architecture

Accumulate products of Activation and Weights

25

Page 26: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

DNN ENGINE micro-architecture

Add bias, apply ReLU activation and writeback

26

Page 27: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

DNN ENGINE micro-architecture

IRQ to host, which retrieves output data

27

Page 28: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Outline

Background and motivation

DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience

Measurement Results

Summary

28

Page 29: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

29

Page 30: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

30

Discard small activation data

Page 31: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

31

Dynamically prune graph connectivity

Page 32: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

32

Page 33: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

33

Page 34: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

34

Page 35: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

35

Discard small activation data

Page 36: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

36

Dynamically prune graph connectivity

Page 37: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

37

Page 38: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

38

Page 39: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

39

Page 40: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

40

Page 41: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN ENGINE micro-architecture

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

41

Page 42: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN ENGINE micro-architecture

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMAIndex

Host

Pro

cess

or

Writeback

SKIPData Read

W-MEM

42

Page 43: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

DNN ENGINE micro-architecture

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMAIndex

Host

Pro

cess

or

Writeback

SKIPData Read

W-MEM

43

W-MEM supports 8b or 16b data types

Page 44: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Outline

Background and motivation

DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience

Measurement Results

Summary

44

Page 45: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

MAC DatapathWeight Load

RZNBUF

IPBUF

XBUF

RZ

Data In

Data Out

CFG

ActivationDMAIndex

Host

Pro

cess

or

Writeback

SKIPData Read

W-MEM

DNN ENGINE micro-architecture

45

[Whatmough et al., ISSCC’17]

Page 46: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Timing error tolerance

10-1

10-3

10-5

10-7

MNIST @ 98.36% AccuracyMemory Datapath Combined

Tim

ing

Viol

atio

n R

ate

TC SM BM TC SM TB TC ALL

> 1,

000,

000

X

46

[Whatmough et al., ISSCC’17]

Page 47: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Outline

Background and motivation

DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience

Measurement Results

Summary

47

Page 48: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

16nm SoC for always-on applications

48

Page 49: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

16nm SoC for always-on applications

49

Page 50: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Measured energy for always-on applications

Nominal Vdd / signoff Fmax

Energy varies with application Exponential with accuracy

On-chip memory critical A few KBs from DRAM >1uJ Constrains model size

50

667MHz / 0.8V

2.5x energy0.3% accuracy

Page 51: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Measured energy improvement

Joint architecture and circuit optimizations

Typical silicon at room temp

Measurements demonstrate 10x energy reduction 4x throughput increase

51

10x

667MHz / 0.8V 667MHz / 0.67V

Page 52: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

State of the art classifiers

52

Dedicated NN accelerators Different performance points Different technologies

Accuracy critical metric Linear classifier 88% Exponential cost

The next 10x improvement Algorithm innovations?

Page 53: DNN ENGINE: A 16nm Sub-uJ Deep Neural Network ......DNN ENGINE: A 16nm Sub -uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough1,2 S. K. Lee 2, N. Mulholland , P

Summary

Inference moving from cloud to the device: always-on sensing DNN ENGINE architecture optimizations

- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience

16nm test chip measurements- Critical to store the model in on-chip memory- 10x energy and 4x throughput improvement- 1uJ per prediction for always-on applications

We are grateful for support from DARPA CRAFT and PERFECT projects

53