supporting compressed -sparse activations and weights on … · diannao, zhang 2015, cambricon-x]...

Supporting Compressed-Sparse Activations and Weights on SIMD-like Accelerator for Sparse Convolutional

Neural Networks

Chien-Yu Lin and Bo-Cheng Lai

Institute of Electronics EngineeringNational Chiao Tung University

Convolutional Neural Network• CNN now dominants visual recognition applications• Face recognition, object detection, autonomous vehicles…

• Major components: deep convolutional layers

2

Network Structure of VGG-16k. Simonyan etal., VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition,ICLR2015http://file.scirp.org/Html/4-7800353_65406.htm

Convolutional Layer

3

Computations of a conv layer

• A lot of parallel multiplications and additions

CNN Acceleration with SIMD

4

• MAC unit can efficiently perform CNN and thus, adopted by many CNN accelerators [Google TPU, DianNao, Zhang 2015, Cambricon-X]

SIMD-like ArchitectureN.P.Jouppi etal.,In-DatacenterPerformanceAnalysisofaTensorProcessingUnit,ISCA2017T. Chen et al., DianNao:asmall-footprinthigh-throughput acceleratorforubiquitousmachine learning,ASPLOS2014C. Zhang et al., Optimizingfpga-based acceleratordesignfordeepconvolutionalneuralnetworks,FPGA2015S. Zhang et al., Cambricon-X: Anacceleratorforsparseneuralnetworks,MICRO2016


5SIMD-like Architecture

CNN is Sparse• About 60% of weights and activations are ZEROs• Zeros in activations are dynamically generated after ReLU• Zeros in weights are obtained with Network Pruning

• Sparsity is promising for speedup (Zero-skipping) and energy reduction (smaller memory footprint)

7

A. Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017S. Han et al., Learningbothweightsandconnections forefficient neuralnetwork,NIPS2015

Sparse CNN on SIMD?


Sparse CNN on SIMD?


Wrong Alignment!

Sparse CNN on SIMD!


A Simple Sparse Layer

12

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

A Simple Sparse Layer

13

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

Compressed-Sparse Data: Only Keep Non-Zeros

14

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

0

1

1

0

1

0

1

0

ActIndex

0

0

1

0

1

1

0

0

WeightIndex

Plus Index: Bit-Vector Recording Zero/Non-Zero

15

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

Effectual Pairs

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

0

1

1

0

1

0

1

0

ActIndex

0

0

1

0

1

1

0

0

WeightIndex

Target of DIM: Find Out Effectual Pairs

16

Dual Indexing Module: Step1

17

Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0

AND

0 0 1 0 1 1 0 0

Co-activated Index0 0 1 0 1 0 0 0

1

0 0 1 0 1 1 0 0

Weight Index

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

Effectual Pairs

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

0

1

1

0

1

0

1

0

ActIndex

0

0

1

0

1

1

0

0

WeightIndex


18


+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 1 0 0


12

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

Weight Index

2


19


+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 0 0 0

0 0 2 0 3 0 0 0

AND

0 0 1 0 1 1 0 0


12

3

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

AND

0 0 1 0 1 0 0 0

0 0 1 0 2 0 0 0

Weight Index

2

3


20


+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 0 0 0

0 0 2 0 3 0 0 0

AND

0 0 1 0 1 1 0 0

w3 w5 w6Stored

WeightsStored

Activationsa2 a3 a5 a7

Effectual Activations


12

3

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

AND

0 0 1 0 1 0 0 0

0 0 1 0 2 0 0 0

Weight Index

2

3

4 4

a5a3 w5w3

MUX MUX

Effectual Weights

Alignment Issue Solved!

21


+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 0 0 0

0 0 2 0 3 0 0 0

AND

0 0 1 0 1 1 0 0

w3 w5 w6Stored

WeightsStored

Activationsa2 a3 a5 a7

Effectual Activations


12

3

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

AND

0 0 1 0 1 0 0 0

0 0 1 0 2 0 0 0

Weight Index

2

3

4 4

a5a3 w5w3

MUX MUX

Effectual Weights

Accelerator Design• Extended from Cambricon-X [MICRO 2016]

22S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016

Encoder

DMA

Off-

chip

Mem

ory

(DR

AM

)

Controller

AB-Out

AB-In

PE

PE

PE

Accelerator Design• Plug DIM into each PE

23

Encoder

DMA

Off-

chip

Mem

ory

(DR

AM

)

Controller

AB-Out

AB-In

PE

PE

PE

PENon-zero A

ct &

Index

WB DIM

MAC

Accelerator Design• Encode output activations on-the-fly

24

Encoder

DMA

Off-

chip

Mem

ory

(DR

AM

)

Controller

AB-Out

AB-In

PE

PE

PE

0 a2 0 a4 a5 0 a7 0

0 1 0 1 1 0 1 0

= 0?

Uncompressed Act

Index

Non-zero Act

AB

-Out

a2 a4 a5 a7

Encoder

Evaluation Methodology• Logic: Synthesis with TSMC 40nm• SRAM and DRAM: CACTI• Benchmark: Open Sparse-AlexNet + ImageNet Data• Experiments: In-house performance simulator

25

M.Naveenetal.,CACTI6.0:ATooltoModelLargeCaches,HPLaboratories,2009Sparse AlexNet, https://github.com/songhan/D eep-Compression-Al exNetJ. Deng et al., Imagenet:Alarge-scalehierarchicalimagedatabase,CVPR,2009

Accelerator Variants

26

Acc Act Weights Index Encoder Area(mm2) Power(mW)

DAW Dense Dense N/A N/A 2.05 395

SpA Sparse Dense IM ✔ 2.15 428

SpW Dense Sparse IM N/A 2.23 441

SpAW Sparse Sparse DIM ✔ 2.34 472

• Overheads: 14.4% in Area and 19.5% in Power

DRAM Access Volume• 47.3% less in DRAM access volume compared to DAW

27

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

DAW SpA SpW SpAW

Normalize

dDR

AMAccessVo

lume

Accelerator

DRAM-Act DRAM-Wei

Energy Consumption• 46% energy reduction compared to DAW

28

00.10.20.30.40.50.60.70.80.91

1.1

DAW SpA SpW SpAWNormalize

dEnergyCon

sumption

Accelerator

FU ABs WBs DRAM-Act DRAM-Wei

Energy-Delay-Product• 55.4% EDP reduction compared to DAW

29

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

DAW SpA SpW SpAW

Normalize

dED

P

Accelerator

Summary• SIMD-like accelerator has alignment issue whileperforming sparse CNN• We propose a novel Dual Indexing Module (DIM)to handle the alignment issue efficiently• By keeping data in a compressed-sparse format,a CNN accelerator with DIM can reduce DRAMaccess volume, energy consumption and EDP for47.3%, 46% and 55.4%

30

Thank You!

Additional Materials

Design Parameters of SpAW

33

Related Work - Cnvlutin• Decouple neuron lanes to do zero-skipping in neurons

34

CNVUnitZero-FreeNeuronArray

Format(ZFNAF)

J. Albericio et al., CNVLUTIN: Ineffectual-neuron-free DNN computing, ISCA 2016

Related Work - Cambricon-X• Utilizing weight sparsity with step indexing (a compressed-sparse format)

35

StepIndexingModule

S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016

supporting compressed -sparse activations and weights on … · diannao, zhang 2015, cambricon-x]...

Documents