dnn engine: a 16nm sub-uj deep neural network ......dnn engine: a 16nm sub -uj dnn inference...

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator

for the Embedded Masses

Paul N. Whatmough1,2

S. K. Lee2, N. Mulholland2, P. Hansen2, S. Kodali3, D. Brooks2, G.-Y. Wei2

1ARM Research, Boston, MA2Harvard University, Cambridge, MA3Princeton University, NJ

The age of deep learning

More Compute

Bigger Data

Deeper Nets

The embedded masses

3

Interpret noisy real-world data from sensor-rich systems

Keep inference at the edge: always-on sensing

Large memory footprint and compute load

DNN inference at the edge

Normalizemean=0, variance=1

Pre-Processing Feature Extraction Classifier

Trained weights

Features

FFT, MFCC, Sobel, HOG, etc

Fully-connected Deep neural network

(DNN)

Norm. DataRaw Data Classes

DNN classifier design flow

Data- Training, validation, test

Training- Optimize hyper-parameters- Minimize size and test error

Quantization- Fixed-point 8-bit / 16-bit

Inference- CPU, DSP, GPU, Accelerator

5

[Reagen et al., ISCA 2016]

DNN ENGINE accelerator architecture

Inpu

t Vec

tor

Out

put C

lass

es

DNN ENGINE

Hidden Layers

Outputs

+DNN Topology

+Weights

Inputs

Sensor Data

WeightNeuron Classification

Argmax

Parallelism and data reuse

Sparse data and small data types

Algorithmic resilience

Outline

Background and motivation

DNN ENGINE- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience

Measurement Results

Summary

7

Fully-connected DNN graph

Inpu

t Vec

tor

Out

put C

lass

es

Hidden Layers

Input Layer Output Layer

8


Inpu

t Vec

tor

Out

put C

lass

es

NeuronWeight

9


Inpu

t Vec

tor

Out

put C

lass

es

SIMD Window

10

Process a group of neurons in parallel


Inpu

tVec

tor

Out

put C

lass

es

Weights

11

Re-use of activation data across neurons


Inpu

t Vec

tor

Out

put C

lass

es

12


Inpu

t Vec

tor

Out

put C

lass

es

13


Inpu

t Vec

tor

Out

put C

lass

es

14


Inpu

t Vec

tor

Out

put C

lass

es

15

Add bias and apply activation function


Inpu

t Vec

tor

Out

put C

lass

es

16


Inpu

t Vec

tor

Out

put C

lass

es

17


Inpu

t Vec

tor

Out

put C

lass

es

18


Inpu

t Vec

tor

Out

put C

lass

es

19

Balancing efficiency and bandwidthNo Data Reuse

20

Parallelism increases throughput and data reuse

Balancing efficiency and bandwidthBandwidth

too high

Parallelism increases throughput and data reuse- But also increases Memory Bandwidth demands

No Data Reuse

21

Balancing efficiency and bandwidth

Parallelism increases throughput and data reuse- But also increases Memory Bandwidth demands

8-Way SIMD is efficient at reasonable memory BW- 10x Activation reuse, with 128b AXI channel

Bandwidth too high

No Data Reuse

128b

/cyc

le

22

DNN ENGINE micro-architecture

MAC DatapathWeight Load

NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

23

8-Way SIMD accelerator architecture


NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM


Host Processor loads configuration and input data

24


NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM


Accumulate products of Activation and Weights

25


NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM


Add bias, apply ReLU activation and writeback

26


NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM


IRQ to host, which retrieves output data

27

Outline



Measurement Results

Summary

28

Exploiting sparse data

Inpu

t Vec

tor

Out

put C

lass

es

29


Inpu

t Vec

tor

Out

put C

lass

es

30

Discard small activation data


Inpu

t Vec

tor

Out

put C

lass

es

31

Dynamically prune graph connectivity


Inpu

t Vec

tor

Out

put C

lass

es

32


Inpu

t Vec

tor

Out

put C

lass

es

33


Inpu

t Vec

tor

Out

put C

lass

es

34


Inpu

t Vec

tor

Out

put C

lass

es

35

Discard small activation data


Inpu

t Vec

tor

Out

put C

lass

es

36

Dynamically prune graph connectivity


Inpu

t Vec

tor

Out

put C

lass

es

37


Inpu

t Vec

tor

Out

put C

lass

es

38


Inpu

t Vec

tor

Out

put C

lass

es

39


Inpu

t Vec

tor

Out

put C

lass

es

40



NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMA

Host

Pro

cess

or

Writeback

Data Read

W-MEM

41



NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMAIndex

Host

Pro

cess

or

Writeback

SKIPData Read

W-MEM

42



NBUF

IPBUF

XBUF

Data In

Data Out

CFG

ActivationDMAIndex

Host

Pro

cess

or

Writeback

SKIPData Read

W-MEM

43

W-MEM supports 8b or 16b data types

Outline



Measurement Results

Summary

44


RZNBUF

IPBUF

XBUF

RZ

Data In

Data Out

CFG

ActivationDMAIndex

Host

Pro

cess

or

Writeback

SKIPData Read

W-MEM


45

[Whatmough et al., ISSCC’17]

Timing error tolerance

10-1

10-3

10-5

10-7

MNIST @ 98.36% AccuracyMemory Datapath Combined

Tim

ing

Viol

atio

n R

ate

TC SM BM TC SM TB TC ALL

> 1,

000,

000

X

46

[Whatmough et al., ISSCC’17]

Outline



Measurement Results

Summary

47

16nm SoC for always-on applications

48

16nm SoC for always-on applications

49

Measured energy for always-on applications

Nominal Vdd / signoff Fmax

Energy varies with application Exponential with accuracy

On-chip memory critical A few KBs from DRAM >1uJ Constrains model size

50

667MHz / 0.8V

2.5x energy0.3% accuracy

Measured energy improvement

Joint architecture and circuit optimizations

Typical silicon at room temp

Measurements demonstrate 10x energy reduction 4x throughput increase

51

10x

667MHz / 0.8V 667MHz / 0.67V

State of the art classifiers

52

Dedicated NN accelerators Different performance points Different technologies

Accuracy critical metric Linear classifier 88% Exponential cost

The next 10x improvement Algorithm innovations?

Summary

Inference moving from cloud to the device: always-on sensing DNN ENGINE architecture optimizations

- Parallelism and data reuse- Sparse data and small data types- Algorithmic resilience

16nm test chip measurements- Critical to store the model in on-chip memory- 10x energy and 4x throughput improvement- 1uJ per prediction for always-on applications

We are grateful for support from DARPA CRAFT and PERFECT projects

53

dnn engine: a 16nm sub-uj deep neural network ......dnn engine: a 16nm sub -uj dnn inference...

Documents