bit fusion - welcome to iscaconf.org · oc fully-connected layer loop: for j in (1 oc) loop: for k...

Bit FusionBit-Level Dynamically Composable

Architecture for Deep Neural Networks

Hardik SharmaJongse ParkNaveen Suda†

Liangzhen Lai†

Benson ChauVikas Chandra†

Hadi Esmaeilzadeh‡Alternative Computing Technologies (ACT) Lab

†Arm, Inc.

Georgia Institute of Technology

‡University of California, San Diego

0%20%40%60%80%

100%

AlexNet

CIFAR1

0

LSTM

LeNet-5

RESN

ET-

18 RNN

SVHN

VGG-7

Avg

1bit/1bit 2bit/2bit 4bit/4bit 8bit/1bit 8bit/8bit

DNNs Tolerate Low-Bitwidth Operations

>99.4% Multiply-Adds require less than 8-bits

2

Bitwidth Flexibility is Necessary for Accuracy

A fixed-bitwidth accelerator would either achieve limitedbenefits (8-bit), or compromise on accuracy (<8-bit)

3

Conv.8b/8b

Conv.4b/4b

Conv.4b/4b

Conv.4b/4b

Conv.4b/4b

FC4b/4b

FC4b/4b

FC8b/8b

Conv.2b/2b

Conv.2b/2b

FC2b/2b

FC2b/2b

AlexNet:IMAGENETdataset(Mishraetal.,WRPN,arXiv2017)

LeNet:MNISTdataset(Lietal.,TWN,arXiv2016)

Our Approach: Bit-level Composability

BitBricks (BBs) are bit-level composable compute units

sy y1 y0

33

6

sx x1 x0

signmode

BitBrick(BB)BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnitWBUF

PsumForward

InputForward

4

Compute units (BitBricks)

logically fuse at runtime to form

Fused-PEs (F-PEs) that dynamically match bit-width

of the DNN layers

5

(b)16xParallelism,Binary(1-bit)orTernary(2-bit)

Psum forward

+ +

+ +

+

F-PE F-PE F-PE F-PE

F-PE F-PE F-PE F-PE

F-PE F-PE F-PE F-PE

F-PE F-PE F-PE F-PE

WBUF

Input forward

(d)NoParallelism,8-bits

Psum forward

Input forward

+ +

+ +

+F-PEWBUF

(c)4xParallelism,Mixed-Bitwidth(2-bitweights,8-bitinputs)

Psum forward

WBUF

Input forward

F-PE

F-PE

F-PE

F-PE

(a)FusionUnitwith16BitBricks

Psum forward

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

Input forward

WBUF

Config #1 : Binary/Ternary Mode

Each BitBrick performs a binary/ternary multiplication16x parallelism

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit2-bit F-PE F-PEF-PE

F-PE F-PE F-PEF-PE

F-PE F-PEF-PE

F-PE F-PEF-PE

F-PE

F-PE

Input

Weight2-bit

6

Config #2: 4-bit Mode

Four BitBricks fuse to form a Fused-PE (F-PE)4x Parallelism

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit

F-PE

F-PEF-PE

Input(4-bit)

Weight(4-bit)

2-bit 2-bit

2-bit 2-bit

Par9alProducts

7

Config #3 : 8-bit, 4-bit (Mixed-Mode)

Eight BitBricks fuse to form a Fused-PE (F-PE)2x Parallelism

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit

F-PE

Input(8-bit)

Weight(4-bit)2-bit 2-bit

Par:alProducts

2-bit2-bit 2-bit2-bit

8

Spatial Fusion vs. Temporal Design

Temporal Design (Bit Serial): Combine results over time

Out

<<

g0 h0

g1 h1

g2 h2

g3 h3

Out

<<

e0 f0

e1 f1

e2 f2

e3 f3

Out

<<

c0 d0

c1 d1

c2 d2

c3 d3

Out

<<

a0 b0

a1 b1

a2 b2

a3 b3

Inpu

ts o

ver

time

1

2

3

Spatial Fusion (Bit Parallel): Combine results over space

Out

<<<<<<<<

a0 b0

c0 d0

e0 f0

g0 h0

a1 b1

c1 d1

e1 f1

g1 h1

a2 b2

c2 d2

e2 f2

g2 h2

a3 b3

c3 d3

e3 f3

g3 h3

1

2

3

Inpu

ts o

ver

time

9

Spatial Fusion Surpasses Temporal Design

Area (um^2) BitBricks Shift-Add RegisterTotal

Area

Temporal 463 2989 1454 4905Fusion Unit 369 934 91 1394

Power (nW) BitBricks Shift-Add RegisterTotal

Power

Temporal 60 550 1103 1712Fusion Unit 46 424 69 538Synthesized using a commercial 45 nm technology

10

3.5x lower

area

3.2x lower

power

Control

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit FusionUnit

FusionUnitFusionUnit

IBUF(Sha

red)

IBUF(Sha

red)

WBUF WBUF

WBUF WBUF

OBUF

+

PoolingUnit Ac.va.onUnit

OBUF

+

PoolingUnit Ac.va.onUnit

Bit Fusion Systolic Array Architecture

11

Programmability: BitFusion ISA

Requirements

Amortize cost of bit-level fusion

Enable flexible Data-Path

Concise

12

ISA: Amortize the Cost of Bit-Level Fusion

Use a block-structured ISA for groups of operations (layers)

Convolu'on8-bit/8-bit


Conv 1

Block begin: 8-bit/8-bit

Block end: next block

Block begin: 4-bit/1-bit


Block end: next block

13

ISA: Concise Expression for DNNs

Use loop instructions as DNNs consist of large number of repeated operations

OC

IC

IC

BB

OC

Fully-Connected Layer

loop: for j in (1 OC)loop: for k in (1 IC)

loop: for i in (1 B)

14

ISA: Concise Expression for DNNs

DNNs have regular memory access patternUse loop indices to generate memory accesses

OC

IC

IC

BB

OC

Fully-Connected Layer

loop: for j in (1 OC)loop: for k in (1 IC)input k 1 + j 0 + i ICweight k 1 + j IC + i 0output k 0 + j 1 + i OC

loop: for i in (1 B)

15

ISA: Flexible Storage

ISA changes the semantics of off-chip and on-chip memory accesses according to bitwidth of operands

2-bitmode

16xparallelism

Need:32-bitinputs,32-bitweights

8-bitmode

1xparallelism

Need:8-bitinput,8-bitweight

16

ISA: Flexible Storage (Software View)

WBUF

32-bit

WBUF

16-bitRegister Register

WBU

F

8-bitReg

Software views the buffers as having a flexible aspect ratio

17

Benchmarked Platforms18

NvidiaTitan-XGPU

NvidiaTegraTX2

ASICBit-Serial

Op5mizedDataflow

HighPerformance

LowPower

Stripes(Micro’16)

Eyeriss(ISCA’16)

Benchmarked DNN Models

SVHNVGG-7

RESNET-18RNN

DNN

AlexNetCIFAR10LSTM

LeNet-5

CNNCNN

CNNRNN

Type

CNNCNNRNNCNN

158MOps317MOps

4,269MOps17MOps

Mul(ply-Adds

2,678MOps617MOps13MOps16MOps

0.8MBytes2.7MBytes

13MBytes8.0MBytes

Bit-FlexibleModelWeights

116.3MBytes3.3MBytes6.2MBytes0.5MBytes

24.4MBytes43.3MBytes

103.7MBytes64.0MBytes

OriginalModelWeights898.6MBytes53.5MBytes49.4MBytes8.2MBytes

19

Comparison with Eyeriss

3.9× speedup and 5.1× energy reduction over Eyeriss

Impr

ovem

ent

over

Eye

riss

0x

4x

8x

12x

AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean

5.1

9.910.0

5.1

1.9

4.34.8

14.0

1.5

3.9

7.78.6

2.71.92.72.4

13.0

1.9

Performance Energy Reduction

20

Comparison with Stripes

2.6× speedup and 3.9× energy reduction over Stripes

Impr

ovem

ent

over

Stri

pes

0×2×4×6×8×

AlexNet Cifar-10 LSTM LeNet-5ResNet-18 RNN SVHN VGG-7 geomean

3.9x4.4x2.7x3.0x

4.4x

7.8x

3.1x

6.0x

2.7x 2.6x2.9x1.8x2.0x2.6x

5.2x

2.1x4.0x

1.8x

Performance Energy Reduction

21

Spee

dup

over

TX

2

0×

10×

20×

30×

AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean

16x

48x

14x

39x

5x11x

38x34x

3x

19x

30x

21x

7x

31x27x

7x

29x23x

TitanX-INT8 Bit Fusion

Comparison with GPUs

Bit Fusion provides almost the same performance as Titan Xp (250 W) with only 895 mW

22

Conclusion

Emerging research shows we can reduce bitwidths for DNNs without losing accuracy

Bit Fusion defines a new dimension of bit-level dynamic composability to leverage this opportunity

BitFusion ISA exposes this capability to software stack

23

bit fusion - welcome to iscaconf.org · oc fully-connected layer loop: for j in (1 oc) loop: for k...

Documents