supporting compressed -sparse activations and weights on … · diannao, zhang 2015, cambricon-x]...
TRANSCRIPT
Supporting Compressed-Sparse Activations and Weights on SIMD-like Accelerator for Sparse Convolutional
Neural Networks
Chien-Yu Lin and Bo-Cheng Lai
Institute of Electronics EngineeringNational Chiao Tung University
Convolutional Neural Network• CNN now dominants visual recognition applications• Face recognition, object detection, autonomous vehicles…
• Major components: deep convolutional layers
2
Network Structure of VGG-16k. Simonyan etal., VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition,ICLR2015http://file.scirp.org/Html/4-7800353_65406.htm
Convolutional Layer
3
Computations of a conv layer
• A lot of parallel multiplications and additions
CNN Acceleration with SIMD
4
• MAC unit can efficiently perform CNN and thus, adopted by many CNN accelerators [Google TPU, DianNao, Zhang 2015, Cambricon-X]
SIMD-like ArchitectureN.P.Jouppi etal.,In-DatacenterPerformanceAnalysisofaTensorProcessingUnit,ISCA2017T. Chen et al., DianNao:asmall-footprinthigh-throughput acceleratorforubiquitousmachine learning,ASPLOS2014C. Zhang et al., Optimizingfpga-based acceleratordesignfordeepconvolutionalneuralnetworks,FPGA2015S. Zhang et al., Cambricon-X: Anacceleratorforsparseneuralnetworks,MICRO2016
CNN Acceleration with SIMD
5SIMD-like Architecture
CNN Acceleration with SIMD
6SIMD-like Architecture
CNN is Sparse• About 60% of weights and activations are ZEROs• Zeros in activations are dynamically generated after ReLU• Zeros in weights are obtained with Network Pruning
• Sparsity is promising for speedup (Zero-skipping) and energy reduction (smaller memory footprint)
7
A. Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017S. Han et al., Learningbothweightsandconnections forefficient neuralnetwork,NIPS2015
Sparse CNN on SIMD?
8SIMD-like Architecture
Sparse CNN on SIMD?
9SIMD-like Architecture
Sparse CNN on SIMD?
10SIMD-like Architecture
Wrong Alignment!
Sparse CNN on SIMD!
11SIMD-like Architecture
A Simple Sparse Layer
12
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
A Simple Sparse Layer
13
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
Compressed-Sparse Data: Only Keep Non-Zeros
14
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
0
1
1
0
1
0
1
0
ActIndex
0
0
1
0
1
1
0
0
WeightIndex
Plus Index: Bit-Vector Recording Zero/Non-Zero
15
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
Effectual Pairs
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
0
1
1
0
1
0
1
0
ActIndex
0
0
1
0
1
1
0
0
WeightIndex
Target of DIM: Find Out Effectual Pairs
16
Dual Indexing Module: Step1
17
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
AND
0 0 1 0 1 1 0 0
Co-activated Index0 0 1 0 1 0 0 0
1
0 0 1 0 1 1 0 0
Weight Index
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
Effectual Pairs
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
0
1
1
0
1
0
1
0
ActIndex
0
0
1
0
1
1
0
0
WeightIndex
Dual Indexing Module: Step2
18
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 1 0 0
Co-activated Index0 0 1 0 1 0 0 0
12
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
Weight Index
2
Dual Indexing Module: Step3
19
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 0 0 0
0 0 2 0 3 0 0 0
AND
0 0 1 0 1 1 0 0
Co-activated Index0 0 1 0 1 0 0 0
12
3
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
AND
0 0 1 0 1 0 0 0
0 0 1 0 2 0 0 0
Weight Index
2
3
Dual Indexing Module: Step4
20
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 0 0 0
0 0 2 0 3 0 0 0
AND
0 0 1 0 1 1 0 0
w3 w5 w6Stored
WeightsStored
Activationsa2 a3 a5 a7
Effectual Activations
Co-activated Index0 0 1 0 1 0 0 0
12
3
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
AND
0 0 1 0 1 0 0 0
0 0 1 0 2 0 0 0
Weight Index
2
3
4 4
a5a3 w5w3
MUX MUX
Effectual Weights
Alignment Issue Solved!
21
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 0 0 0
0 0 2 0 3 0 0 0
AND
0 0 1 0 1 1 0 0
w3 w5 w6Stored
WeightsStored
Activationsa2 a3 a5 a7
Effectual Activations
Co-activated Index0 0 1 0 1 0 0 0
12
3
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
AND
0 0 1 0 1 0 0 0
0 0 1 0 2 0 0 0
Weight Index
2
3
4 4
a5a3 w5w3
MUX MUX
Effectual Weights
Accelerator Design• Extended from Cambricon-X [MICRO 2016]
22S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016
Encoder
DMA
Off-
chip
Mem
ory
(DR
AM
)
Controller
AB-Out
AB-In
PE
PE
PE
Accelerator Design• Plug DIM into each PE
23
Encoder
DMA
Off-
chip
Mem
ory
(DR
AM
)
Controller
AB-Out
AB-In
PE
PE
PE
PENon-zero A
ct &
Index
WB DIM
MAC
Accelerator Design• Encode output activations on-the-fly
24
Encoder
DMA
Off-
chip
Mem
ory
(DR
AM
)
Controller
AB-Out
AB-In
PE
PE
PE
0 a2 0 a4 a5 0 a7 0
0 1 0 1 1 0 1 0
= 0?
Uncompressed Act
Index
Non-zero Act
AB
-Out
a2 a4 a5 a7
Encoder
Evaluation Methodology• Logic: Synthesis with TSMC 40nm• SRAM and DRAM: CACTI• Benchmark: Open Sparse-AlexNet + ImageNet Data• Experiments: In-house performance simulator
25
M.Naveenetal.,CACTI6.0:ATooltoModelLargeCaches,HPLaboratories,2009Sparse AlexNet, https://github.com/songhan/D eep-Compression-Al exNetJ. Deng et al., Imagenet:Alarge-scalehierarchicalimagedatabase,CVPR,2009
Accelerator Variants
26
Acc Act Weights Index Encoder Area(mm2) Power(mW)
DAW Dense Dense N/A N/A 2.05 395
SpA Sparse Dense IM ✔ 2.15 428
SpW Dense Sparse IM N/A 2.23 441
SpAW Sparse Sparse DIM ✔ 2.34 472
• Overheads: 14.4% in Area and 19.5% in Power
DRAM Access Volume• 47.3% less in DRAM access volume compared to DAW
27
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
DAW SpA SpW SpAW
Normalize
dDR
AMAccessVo
lume
Accelerator
DRAM-Act DRAM-Wei
Energy Consumption• 46% energy reduction compared to DAW
28
00.10.20.30.40.50.60.70.80.91
1.1
DAW SpA SpW SpAWNormalize
dEnergyCon
sumption
Accelerator
FU ABs WBs DRAM-Act DRAM-Wei
Energy-Delay-Product• 55.4% EDP reduction compared to DAW
29
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
DAW SpA SpW SpAW
Normalize
dED
P
Accelerator
Summary• SIMD-like accelerator has alignment issue whileperforming sparse CNN• We propose a novel Dual Indexing Module (DIM)to handle the alignment issue efficiently• By keeping data in a compressed-sparse format,a CNN accelerator with DIM can reduce DRAMaccess volume, energy consumption and EDP for47.3%, 46% and 55.4%
30
Thank You!
Additional Materials
Design Parameters of SpAW
33
Related Work - Cnvlutin• Decouple neuron lanes to do zero-skipping in neurons
34
CNVUnitZero-FreeNeuronArray
Format(ZFNAF)
J. Albericio et al., CNVLUTIN: Ineffectual-neuron-free DNN computing, ISCA 2016
Related Work - Cambricon-X• Utilizing weight sparsity with step indexing (a compressed-sparse format)
35
StepIndexingModule
S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016