dnpu: an 8.1tops/w reconfigurable cnn-rnn processor for ... · pdf file14.2: dnpu: an...

14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks© 2017 IEEE International Solid-State Circuits Conference 1 of 39

DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks

Dongjoo Shin, Jinmook Lee, Jinsu Lee,and Hoi-Jun Yoo

Semiconductor System LaboratorySchool of EE, KAIST


Outline Introduction Processor Architecture Key Features

1. Heterogeneous architecture2. Mixed workload division method3. Dynamic fixed-point with on-line adaptation4. Quantization-table-based multiplier

Implementation Results Conclusion

Configurability

Low Power


Why Deep Learning? Overwhelming performance

Unlimited applicability– Image classification, action recognition, sign recognition…– Super resolution, depth map generation, …– Translation, natural language processing, …– Text generation, image captioning, composition, …

2011 2012 2013 2014 Human 2015 20160

5

10

15

20

25 Computer VisionDeep LearningHuman

2010

Introduction


Types of Deep Learning

Hidden

ConvolutionPooling

Fully-connected LayerChracteristic Convolutional layer Feedback path,

internal state

Major Application

Layers

Simple classification, hand written letter

recognition ...

Vision processing,face recognition, image

classification, ...

Sequential data processing,

translation, speech recognition, ...

3~10 layers 5~100s layers 3~5 layers

(multi-layer perceptron) (convolutional) (recurrent)

Introduction


Why both CNN & RNN? CNN: Visual feature extraction and recognition

– Face recognition, image classification… RNN: Sequential data recognition and generation

– Translation, speech recognition… CNN + RNN: CNN-extracted features RNN input

Previous works– Optimized for convolution layer only: [1], [2], [3]– Optimized for FC layer and RNN only: [4]

Introduction

[1] Y. Chen, ISSCC 2016

Two dogs playing in the water

Image Captioning

[2] B. Moons, SOVC 2016 [3] J. Sim, ISSCC 2016 [4] S. Han, ISCA 2016

ActionRecognition

CNN: Visual feature extractionRNN: Sentence generation

CNN: Visual feature extractionRNN: Temporal information recognition


Deep Learning-dedicated SoC: DNPUIntroduction

ControlALU

Cache

ALU ALU

ALU

Low compute density

High programmability

High compute density

Floating point data type

Fixed computation pattern

Dedicated memory arch.

Complex control logic Parallel Computation

Dedicated data type

Higher Energy Efficiency (Operations/W)

Parallel Processing

Highly DedicatedArchitecture

Higher Programmability


Deep Learning-dedicated SoC: DNPUIntroduction

ControlALU

Cache

ALU ALU

ALU

Low compute density

High programmability

High compute density

Floating point data type

Fixed computation pattern

Dedicated memory arch.

Complex control logic Parallel Computation

Dedicated data type

Higher Energy Efficiency (Operations/W)

Parallel Processing

Highly DedicatedArchitecture

Higher Programmability

Deep learning itself has high adaptability for various applications


Design Challenges Heterogeneous characteristics

– Conv. Layer (CNN): Computation >> Parameter– FC Layer (CNN), RNN: Computation ~= Parameter

Introduction


Design Challenges Acceleration of Conv. with FC-RNN dedicated HW

– High memory transaction– Low PE utilization– Mismatch of the computational patterns

Acceleration of FC-RNNs with Conv. dedicated HW

– Cannot exploit reusability (image, kernel, convolution)– Cannot achieve required throughput

Introduction


Design Challenges Various configurations of layers

– Image size, channel size, kernel (filter) size

Example: VGG-16 Convolution Layers

Introduction


Design Challenges Large amount of data & massive MACs

Introduction


Proposed Processor: DNPU For High Configurability,

1. Heterogeneous Architecture2. Mixed Workload Division

For Low-power Operation,3. Dynamic Fixed-point with On-line Adaptation4. Quantization-table-based Multiplication

An 8.1TOPS/W CNN-RNN Processor for General-purpose DNNs


Overall ArchitectureFeature 1. Heterogeneous Architecture

Convolution Cluster 1


Aggregation Core

Custom Instruction Set-based Controller

Ext. IF

Ext. IF

Instruction Memory Data Memory

Input Buffer

Quantization Table-basedMatrix Multiplier

Accumulation Registers

Activation Function LUT

Weight BufferConvolution Cluster 0


Convolution layer (CNN)

Fully-connected layer (CNN)

Recurrent neural network

HeterogeneousArchitecture


Convolution ProcessorFeature 1. Heterogeneous Architecture

Distributed memory-based architecture

Data NoC

Instruction Network

Partial Sum Network

ImageMemory(16KB)

WeightMem.(1KB)

PE Array(12x4)

Partial Sum Registers

FSMCtrlr.

ImageMemory(16KB)Pooling

Relu

FSMCtrlr.

Partial Sum Registers

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core

Conv.Core


FC Layer and RNN Hardware Sharing FC layer - Mapping to matrix multiplication

RNN – Mapping to matrix multiplication

Feature 1. Heterogeneous Architecture

om = W0m i0 + W1m i1 + + Wnm in + bm

i0 i1 in b0W10

1

Wn0

b1W11

Wn1

bmW1m

Wnm

0 1 2 n

0 1 om


Limited On-chip Memory Size Conv. operation with limited on-chip memory

Feature 2. Mixed Workload Division Method

Convolution Operation

Workload should be dividedto reduce the required on-chip memory size

On-chip memory size

100s KB ~ 1s MB

On-chip memory

On-chip processing elements

Trade-off

1 2

3


Image Division Workload division with image direction


Ci

W''

H'

Co

Wf

HfH'

W'

W x H = W' x H' x Div.#


Channel Division Workload division with channel direction


i o

f

f


Mixed Division Workload division with both directions


Input Image i i i

Weight

Output Image

f f i o

o o o

i i i

f f i o

o o o

i i i

f f i o

o o o


Variation Through Layers VGG-16 convolution layer example


Large Number of Channels

Large Image Size


Off-chip Access Analysis Mixed division can take lower points


1 2 3 4 5 6 7 8 9 10 11 12 13

10

20

30

40

50

Convolutional Layer Number

Off-

chip

Mem

ory

Acc

ess

(MB

)

Image DivisionChannel DivisionMixed Division


Layer-by-Layer Data Distribution Data distribution in each layer

Floating-point implementation– Large data range, but high cost

Fixed-point implementation– Low cost, but limited data range

Feature 3. Dynamic Fixed-point with On-line AdaptationD

ensi

ty

Max

. Val

ue


Each layer has different WL and FL Having floating-point characteristic via layers

WL and FL are fixed in a same layer Enabling fixed-point operation

Layer-by-Layer Dynamic Fixed-point

FL: 1

FL: 2WL: 4

FL: -2

WL: 8

WL: 8

Fraction Point

Conv. Layer 1 – WL: 8, FL: 1

Conv. Layer 2 – WL: 8, FL: 2

Conv. Layer 3 – WL: 4, FL: -2

Fraction Point

Word Length

Integer Bits Fractional Bits

Word Length(WL: 4 ~ 16)

Fraction Length(FL: -15 ~ 16)

Example)

Feature 3. Dynamic Fixed-point with On-line Adaptation


Off-line learning-based FL selection

– Off-line (off-chip) learning-based FL selection– FL is trained to fit with given image data set– Selected FL is used for every image at run time

Previous Approach

Image Data Set

Fraction Length (FL) SetsSet1 – L1: 0, L2: 0, L3:-4 …Set2 – L1: 1, L2: -1, L3: -5 …Set3 – L1: 1, L2: -3, L4: -6 …

Finding FL setwhich shows the minimum error

with given image data set

L: LayerL1

L2L3



On-line adaptation-based FL selection

– On-line (on-chip) learning-based FL selection– FL is dynamically fit to current input image No off-chip learning, lower required WL

Proposed On-line Adaptation

Threshold



Performance Comparisons Image classification results

2 4 86 10 12 140

0.2

0.6

0.4

0.8

Proposed Dynamic-Fixed-Point with AdaptationDynamic Fixed-Point Fixed-Point

Used Model: VGG-16Used Dataset: ImageNet

Baseline: 32-bit floating point



LUT-based Reconfigurable Multiplier

, = + − 1, + − 1 × ( , )50,176 times multiplications

for each kernel weightInput image(I): 224 X 224

Kernel(k): 3 X 3@



LUT-based Reconfigurable MultiplierFeature 3. Dynamic Fixed-point with On-line Adaptation


Weight QuantizationFeature 4. Quantization Table-based Multiplier


Quantization Table Construction Pre-computation with each quantized weight

Feature 4. Quantization Table-based Multiplier

Input RegistersI1

0 1 2 15w0 w1 w2 w15

I0

IndexWeight

I7

Weight Mapping Table

I0 x w0

I0 x w1

I0 x w15

Q-table for I0

I1 x w0

I1 x w1

I1 x w15

Q-table for I1

I7 x w0

I7 x w1

I7 x w15

Q-table for I7

Multiplications between input and quantized weights

Multiplication results arealso quantized to 16 values.


Multiplication with Quantization Table Decode index to load the pre-computed result



Quantization Table-based Multiplier Off-chip memory BW

– Quantization: 16-bit weight 4-bit index– Zero skipping: Skip weight load for zero input

Multiplications– 99% of multiplications Table look-up


Quantization(4-bit)

Quantization with Zero Skipping


Chip Photograph and SummaryImplementation Results


Measurement ResultsC

lock

Fre

quen

cy (M

Hz)

TOPS

/W

Pow

er (m

W)

Implementation Results


Measurement Results Dynamic fixed-point with on-line adaptation



Demonstration VideoImplementation Results


Performance Comparison Comparison with convolution processors

[1] Y. Chen, et al., ISSCC 2016 [2] B. Moons, et al., SOVC 2016



Performance Comparison Comparison with FC layer processor

[4] S. Han, et al., ISCA 2016


Proposed work is the only one which can support both CNN and RNN


Conclusion An Energy Efficient CNN-RNN Processor SoC

is proposed for General-purpose DNNs Key Features:

– Support both CNN and RNN– Mixed workload division method– Dynamic fixed-point with on-line adaptation– Quantization table-based multiplier

DNPU: An 8.1TOPS/W reconfigurable CNN-RNN Processor SoC

dnpu: an 8.1tops/w reconfigurable cnn-rnn processor for ... · pdf file14.2: dnpu: an...

Documents