supporting compressed -sparse activations and weights on … · diannao, zhang 2015, cambricon-x]...

35
Supporting Compressed-Sparse Activations and Weights on SIMD-like Accelerator for Sparse Convolutional Neural Networks Chien-Yu Lin and Bo-Cheng Lai Institute of Electronics Engineering National Chiao Tung University

Upload: others

Post on 23-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Supporting Compressed-Sparse Activations and Weights on SIMD-like Accelerator for Sparse Convolutional

Neural Networks

Chien-Yu Lin and Bo-Cheng Lai

Institute of Electronics EngineeringNational Chiao Tung University

Page 2: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Convolutional Neural Network• CNN now dominants visual recognition applications• Face recognition, object detection, autonomous vehicles…

• Major components: deep convolutional layers

2

Network Structure of VGG-16k. Simonyan etal., VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition,ICLR2015http://file.scirp.org/Html/4-7800353_65406.htm

Page 3: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Convolutional Layer

3

Computations of a conv layer

• A lot of parallel multiplications and additions

Page 4: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

CNN Acceleration with SIMD

4

• MAC unit can efficiently perform CNN and thus, adopted by many CNN accelerators [Google TPU, DianNao, Zhang 2015, Cambricon-X]

SIMD-like ArchitectureN.P.Jouppi etal.,In-DatacenterPerformanceAnalysisofaTensorProcessingUnit,ISCA2017T. Chen et al., DianNao:asmall-footprinthigh-throughput acceleratorforubiquitousmachine learning,ASPLOS2014C. Zhang et al., Optimizingfpga-based acceleratordesignfordeepconvolutionalneuralnetworks,FPGA2015S. Zhang et al., Cambricon-X: Anacceleratorforsparseneuralnetworks,MICRO2016

Page 5: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

CNN Acceleration with SIMD

5SIMD-like Architecture

Page 6: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

CNN Acceleration with SIMD

6SIMD-like Architecture

Page 7: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

CNN is Sparse• About 60% of weights and activations are ZEROs• Zeros in activations are dynamically generated after ReLU• Zeros in weights are obtained with Network Pruning

• Sparsity is promising for speedup (Zero-skipping) and energy reduction (smaller memory footprint)

7

A. Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017S. Han et al., Learningbothweightsandconnections forefficient neuralnetwork,NIPS2015

Page 8: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Sparse CNN on SIMD?

8SIMD-like Architecture

Page 9: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Sparse CNN on SIMD?

9SIMD-like Architecture

Page 10: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Sparse CNN on SIMD?

10SIMD-like Architecture

Wrong Alignment!

Page 11: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Sparse CNN on SIMD!

11SIMD-like Architecture

Page 12: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

A Simple Sparse Layer

12

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

Page 13: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

A Simple Sparse Layer

13

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

Page 14: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Compressed-Sparse Data: Only Keep Non-Zeros

14

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

Page 15: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

0

1

1

0

1

0

1

0

ActIndex

0

0

1

0

1

1

0

0

WeightIndex

Plus Index: Bit-Vector Recording Zero/Non-Zero

15

Page 16: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

Effectual Pairs

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

0

1

1

0

1

0

1

0

ActIndex

0

0

1

0

1

1

0

0

WeightIndex

Target of DIM: Find Out Effectual Pairs

16

Page 17: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Dual Indexing Module: Step1

17

Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0

AND

0 0 1 0 1 1 0 0

Co-activated Index0 0 1 0 1 0 0 0

1

0 0 1 0 1 1 0 0

Weight Index

w3

w5

w6

Weight Output

o1

Act

a2

a5

a3

a7

Effectual Pairs

w3 w5 w6Stored

Weights

a2 a3 a5Stored

Activationsa7

0

1

1

0

1

0

1

0

ActIndex

0

0

1

0

1

1

0

0

WeightIndex

Page 18: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Dual Indexing Module: Step2

18

Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0

+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 1 0 0

Co-activated Index0 0 1 0 1 0 0 0

12

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

Weight Index

2

Page 19: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Dual Indexing Module: Step3

19

Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0

+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 0 0 0

0 0 2 0 3 0 0 0

AND

0 0 1 0 1 1 0 0

Co-activated Index0 0 1 0 1 0 0 0

12

3

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

AND

0 0 1 0 1 0 0 0

0 0 1 0 2 0 0 0

Weight Index

2

3

Page 20: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Dual Indexing Module: Step4

20

Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0

+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 0 0 0

0 0 2 0 3 0 0 0

AND

0 0 1 0 1 1 0 0

w3 w5 w6Stored

WeightsStored

Activationsa2 a3 a5 a7

Effectual Activations

Co-activated Index0 0 1 0 1 0 0 0

12

3

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

AND

0 0 1 0 1 0 0 0

0 0 1 0 2 0 0 0

Weight Index

2

3

4 4

a5a3 w5w3

MUX MUX

Effectual Weights

Page 21: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Alignment Issue Solved!

21

Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0

+ + + + + + + +

0 1 2 2 3 3 4 4

AND

0 0 1 0 1 0 0 0

0 0 2 0 3 0 0 0

AND

0 0 1 0 1 1 0 0

w3 w5 w6Stored

WeightsStored

Activationsa2 a3 a5 a7

Effectual Activations

Co-activated Index0 0 1 0 1 0 0 0

12

3

0 0 1 0 1 1 0 0

+ + + + + + + +

0 0 1 1 2 3 3 3

AND

0 0 1 0 1 0 0 0

0 0 1 0 2 0 0 0

Weight Index

2

3

4 4

a5a3 w5w3

MUX MUX

Effectual Weights

Page 22: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Accelerator Design• Extended from Cambricon-X [MICRO 2016]

22S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016

Encoder

DMA

Off-

chip

Mem

ory

(DR

AM

)

Controller

AB-Out

AB-In

PE

PE

PE

Page 23: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Accelerator Design• Plug DIM into each PE

23

Encoder

DMA

Off-

chip

Mem

ory

(DR

AM

)

Controller

AB-Out

AB-In

PE

PE

PE

PENon-zero A

ct &

Index

WB DIM

MAC

Page 24: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Accelerator Design• Encode output activations on-the-fly

24

Encoder

DMA

Off-

chip

Mem

ory

(DR

AM

)

Controller

AB-Out

AB-In

PE

PE

PE

0 a2 0 a4 a5 0 a7 0

0 1 0 1 1 0 1 0

= 0?

Uncompressed Act

Index

Non-zero Act

AB

-Out

a2 a4 a5 a7

Encoder

Page 25: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Evaluation Methodology• Logic: Synthesis with TSMC 40nm• SRAM and DRAM: CACTI• Benchmark: Open Sparse-AlexNet + ImageNet Data• Experiments: In-house performance simulator

25

M.Naveenetal.,CACTI6.0:ATooltoModelLargeCaches,HPLaboratories,2009Sparse AlexNet, https://github.com/songhan/D eep-Compression-Al exNetJ. Deng et al., Imagenet:Alarge-scalehierarchicalimagedatabase,CVPR,2009

Page 26: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Accelerator Variants

26

Acc Act Weights Index Encoder Area(mm2) Power(mW)

DAW Dense Dense N/A N/A 2.05 395

SpA Sparse Dense IM ✔ 2.15 428

SpW Dense Sparse IM N/A 2.23 441

SpAW Sparse Sparse DIM ✔ 2.34 472

• Overheads: 14.4% in Area and 19.5% in Power

Page 27: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

DRAM Access Volume• 47.3% less in DRAM access volume compared to DAW

27

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

DAW SpA SpW SpAW

Normalize

dDR

AMAccessVo

lume

Accelerator

DRAM-Act DRAM-Wei

Page 28: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Energy Consumption• 46% energy reduction compared to DAW

28

00.10.20.30.40.50.60.70.80.91

1.1

DAW SpA SpW SpAWNormalize

dEnergyCon

sumption

Accelerator

FU ABs WBs DRAM-Act DRAM-Wei

Page 29: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Energy-Delay-Product• 55.4% EDP reduction compared to DAW

29

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

DAW SpA SpW SpAW

Normalize

dED

P

Accelerator

Page 30: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Summary• SIMD-like accelerator has alignment issue whileperforming sparse CNN• We propose a novel Dual Indexing Module (DIM)to handle the alignment issue efficiently• By keeping data in a compressed-sparse format,a CNN accelerator with DIM can reduce DRAMaccess volume, energy consumption and EDP for47.3%, 46% and 55.4%

30

Page 31: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Thank You!

Page 32: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Additional Materials

Page 33: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Design Parameters of SpAW

33

Page 34: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Related Work - Cnvlutin• Decouple neuron lanes to do zero-skipping in neurons

34

CNVUnitZero-FreeNeuronArray

Format(ZFNAF)

J. Albericio et al., CNVLUTIN: Ineffectual-neuron-free DNN computing, ISCA 2016

Page 35: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of

Related Work - Cambricon-X• Utilizing weight sparsity with step indexing (a compressed-sparse format)

35

StepIndexingModule

S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016