bit fusion - welcome to iscaconf.org · oc fully-connected layer loop: for j in (1 oc) loop: for k...
TRANSCRIPT
![Page 1: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/1.jpg)
Bit FusionBit-Level Dynamically Composable
Architecture for Deep Neural Networks
Hardik SharmaJongse ParkNaveen Suda†
Liangzhen Lai†
Benson ChauVikas Chandra†
Hadi Esmaeilzadeh‡Alternative Computing Technologies (ACT) Lab
†Arm, Inc.
Georgia Institute of Technology
‡University of California, San Diego
![Page 2: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/2.jpg)
0%20%40%60%80%
100%
AlexNet
CIFAR1
0
LSTM
LeNet-5
RESN
ET-
18 RNN
SVHN
VGG-7
Avg
1bit/1bit 2bit/2bit 4bit/4bit 8bit/1bit 8bit/8bit
DNNs Tolerate Low-Bitwidth Operations
>99.4% Multiply-Adds require less than 8-bits
2
![Page 3: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/3.jpg)
Bitwidth Flexibility is Necessary for Accuracy
A fixed-bitwidth accelerator would either achieve limitedbenefits (8-bit), or compromise on accuracy (<8-bit)
3
Conv.8b/8b
Conv.4b/4b
Conv.4b/4b
Conv.4b/4b
Conv.4b/4b
FC4b/4b
FC4b/4b
FC8b/8b
Conv.2b/2b
Conv.2b/2b
FC2b/2b
FC2b/2b
AlexNet:IMAGENETdataset(Mishraetal.,WRPN,arXiv2017)
LeNet:MNISTdataset(Lietal.,TWN,arXiv2016)
![Page 4: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/4.jpg)
Our Approach: Bit-level Composability
BitBricks (BBs) are bit-level composable compute units
sy y1 y0
33
6
sx x1 x0
signmode
BitBrick(BB)BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnitWBUF
PsumForward
InputForward
4
![Page 5: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/5.jpg)
Compute units (BitBricks)
logically fuse at runtime to form
Fused-PEs (F-PEs) that dynamically match bit-width
of the DNN layers
5
(b)16xParallelism,Binary(1-bit)orTernary(2-bit)
Psum forward
+ +
+ +
+
F-PE F-PE F-PE F-PE
F-PE F-PE F-PE F-PE
F-PE F-PE F-PE F-PE
F-PE F-PE F-PE F-PE
WBUF
Input forward
(d)NoParallelism,8-bits
Psum forward
Input forward
+ +
+ +
+F-PEWBUF
(c)4xParallelism,Mixed-Bitwidth(2-bitweights,8-bitinputs)
Psum forward
WBUF
Input forward
F-PE
F-PE
F-PE
F-PE
(a)FusionUnitwith16BitBricks
Psum forward
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
Input forward
WBUF
![Page 6: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/6.jpg)
Config #1 : Binary/Ternary Mode
Each BitBrick performs a binary/ternary multiplication16x parallelism
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit2-bit F-PE F-PEF-PE
F-PE F-PE F-PEF-PE
F-PE F-PEF-PE
F-PE F-PEF-PE
F-PE
F-PE
Input
Weight2-bit
6
![Page 7: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/7.jpg)
Config #2: 4-bit Mode
Four BitBricks fuse to form a Fused-PE (F-PE)4x Parallelism
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit
F-PE
F-PEF-PE
Input(4-bit)
Weight(4-bit)
2-bit 2-bit
2-bit 2-bit
Par9alProducts
7
![Page 8: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/8.jpg)
Config #3 : 8-bit, 4-bit (Mixed-Mode)
Eight BitBricks fuse to form a Fused-PE (F-PE)2x Parallelism
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit
F-PE
Input(8-bit)
Weight(4-bit)2-bit 2-bit
Par:alProducts
2-bit2-bit 2-bit2-bit
8
![Page 9: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/9.jpg)
Spatial Fusion vs. Temporal Design
Temporal Design (Bit Serial): Combine results over time
Out
<<
g0 h0
g1 h1
g2 h2
g3 h3
Out
<<
e0 f0
e1 f1
e2 f2
e3 f3
Out
<<
c0 d0
c1 d1
c2 d2
c3 d3
Out
<<
a0 b0
a1 b1
a2 b2
a3 b3
Inpu
ts o
ver
time
1
2
3
Spatial Fusion (Bit Parallel): Combine results over space
Out
<<<<<<<<
a0 b0
c0 d0
e0 f0
g0 h0
a1 b1
c1 d1
e1 f1
g1 h1
a2 b2
c2 d2
e2 f2
g2 h2
a3 b3
c3 d3
e3 f3
g3 h3
1
2
3
Inpu
ts o
ver
time
9
![Page 10: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/10.jpg)
Spatial Fusion Surpasses Temporal Design
Area (um^2) BitBricks Shift-Add RegisterTotal
Area
Temporal 463 2989 1454 4905Fusion Unit 369 934 91 1394
Power (nW) BitBricks Shift-Add RegisterTotal
Power
Temporal 60 550 1103 1712Fusion Unit 46 424 69 538Synthesized using a commercial 45 nm technology
10
3.5x lower
area
3.2x lower
power
![Page 11: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/11.jpg)
Control
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit FusionUnit
FusionUnitFusionUnit
IBUF(Sha
red)
IBUF(Sha
red)
WBUF WBUF
WBUF WBUF
OBUF
+
PoolingUnit Ac.va.onUnit
OBUF
+
PoolingUnit Ac.va.onUnit
Bit Fusion Systolic Array Architecture
11
![Page 12: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/12.jpg)
Programmability: BitFusion ISA
Requirements
Amortize cost of bit-level fusion
Enable flexible Data-Path
Concise
12
![Page 13: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/13.jpg)
ISA: Amortize the Cost of Bit-Level Fusion
Use a block-structured ISA for groups of operations (layers)
Convolu'on8-bit/8-bit
Convolu'on4-bit/1-bit
Conv 1
Block begin: 8-bit/8-bit
Block end: next block
Block begin: 4-bit/1-bit
Convolu'on4-bit/8-bit
Block end: next block
13
![Page 14: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/14.jpg)
ISA: Concise Expression for DNNs
Use loop instructions as DNNs consist of large number of repeated operations
OC
IC
IC
BB
OC
Fully-Connected Layer
loop: for j in (1 OC)loop: for k in (1 IC)
loop: for i in (1 B)
14
![Page 15: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/15.jpg)
ISA: Concise Expression for DNNs
DNNs have regular memory access patternUse loop indices to generate memory accesses
OC
IC
IC
BB
OC
Fully-Connected Layer
loop: for j in (1 OC)loop: for k in (1 IC)input k 1 + j 0 + i ICweight k 1 + j IC + i 0output k 0 + j 1 + i OC
loop: for i in (1 B)
15
![Page 16: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/16.jpg)
ISA: Flexible Storage
ISA changes the semantics of off-chip and on-chip memory accesses according to bitwidth of operands
2-bitmode
16xparallelism
Need:32-bitinputs,32-bitweights
8-bitmode
1xparallelism
Need:8-bitinput,8-bitweight
16
![Page 17: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/17.jpg)
ISA: Flexible Storage (Software View)
WBUF
32-bit
WBUF
16-bitRegister Register
WBU
F
8-bitReg
Software views the buffers as having a flexible aspect ratio
17
![Page 18: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/18.jpg)
Benchmarked Platforms18
NvidiaTitan-XGPU
NvidiaTegraTX2
ASICBit-Serial
Op5mizedDataflow
HighPerformance
LowPower
Stripes(Micro’16)
Eyeriss(ISCA’16)
![Page 19: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/19.jpg)
Benchmarked DNN Models
SVHNVGG-7
RESNET-18RNN
DNN
AlexNetCIFAR10LSTM
LeNet-5
CNNCNN
CNNRNN
Type
CNNCNNRNNCNN
158MOps317MOps
4,269MOps17MOps
Mul(ply-Adds
2,678MOps617MOps13MOps16MOps
0.8MBytes2.7MBytes
13MBytes8.0MBytes
Bit-FlexibleModelWeights
116.3MBytes3.3MBytes6.2MBytes0.5MBytes
24.4MBytes43.3MBytes
103.7MBytes64.0MBytes
OriginalModelWeights898.6MBytes53.5MBytes49.4MBytes8.2MBytes
19
![Page 20: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/20.jpg)
Comparison with Eyeriss
3.9× speedup and 5.1× energy reduction over Eyeriss
Impr
ovem
ent
over
Eye
riss
0x
4x
8x
12x
AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean
5.1
9.910.0
5.1
1.9
4.34.8
14.0
1.5
3.9
7.78.6
2.71.92.72.4
13.0
1.9
Performance Energy Reduction
20
![Page 21: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/21.jpg)
Comparison with Stripes
2.6× speedup and 3.9× energy reduction over Stripes
Impr
ovem
ent
over
Stri
pes
0×2×4×6×8×
AlexNet Cifar-10 LSTM LeNet-5ResNet-18 RNN SVHN VGG-7 geomean
3.9x4.4x2.7x3.0x
4.4x
7.8x
3.1x
6.0x
2.7x 2.6x2.9x1.8x2.0x2.6x
5.2x
2.1x4.0x
1.8x
Performance Energy Reduction
21
![Page 22: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/22.jpg)
Spee
dup
over
TX
2
0×
10×
20×
30×
AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean
16x
48x
14x
39x
5x11x
38x34x
3x
19x
30x
21x
7x
31x27x
7x
29x23x
TitanX-INT8 Bit Fusion
Comparison with GPUs
Bit Fusion provides almost the same performance as Titan Xp (250 W) with only 895 mW
22
![Page 23: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/23.jpg)
Conclusion
Emerging research shows we can reduce bitwidths for DNNs without losing accuracy
Bit Fusion defines a new dimension of bit-level dynamic composability to leverage this opportunity
BitFusion ISA exposes this capability to software stack
23
![Page 24: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC](https://reader036.vdocuments.mx/reader036/viewer/2022062605/5fdc9ea5af9ea2611f3a4c4e/html5/thumbnails/24.jpg)