neural cache: bit-serial in-cache acceleration of deep ...2 transforming caches into massively...

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan

Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bits Research Group

1

2

CPU GPU

Can we transform CPU into a neural accelerator?

$

3

GPU

Can we transform CPU into a neural accelerator?

CPU Neural Cache

++ Parallelism

-- Data Movement

Transforming caches into massively parallel vector ALUs

4

18-core Xeon processor45 MB LLC

18 LLC slices


5


Way

1

Way

20

Way

2

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

18 LLC slices 360 ways


6


Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

WL

Row decoder

0

255

255BL/BLB

8kB SRAM array

18 LLC slices 360 ways 5760 arrays

Way

2


7


Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Row decoders

0

255

255

= A + B

BL/BLB

Logic

Arr

ay A

Arr

ay B

0

1

1

0

0

0

1

1

1

0

0

1A +

B

18 LLC slices 360 ways 5760 arrays

Way

2


8


Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs



Row decoders

0

255

255

= A + B

BL/BLB

Logic

Arr

ay A

Arr

ay B

0

1

1

0

0

0

1

1

1

0

0

1A +

B

Way

2

Way

2


9


Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

WL

Row decoders

0

255

255

= A + B

BL/BLB

Logic

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

Array AArray B

A + B

Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs

✓ Multiply ✓ Divide ✓ Add

Bit-serial operation @2.5 GHz


Configurable Precision

10

A + B

Row decoders

0

255

255BL/BLB

Logic

Bit-parallel arithmetic

Why bit-serial?

11

A + B

Row decoders

0

255

255BL/BLB

Logic

Array A

Array B

A + B

Word 3Word 2Word 1Word 0


}

}

}


Why bit-serial?

12

A + B

WL1

Row decoders

0

255

255

S

BL/BLB

Logic

WL2

Array A

Array B

A + B



}

}

}


Why bit-serial?

13

A + B

WL1

Row decoders

0

255

255BL/BLB

Logic

WL2

Array A

Array B

A + B



}

}

}

C

S S


Carry propagation across bitlines

Why bit-serial?

14

A + B

WL1

Row decoders

0

255

255BL/BLB

Logic

WL2

Array A

Array B

A + B



}

}

}

C

S S S

C


Carry propagation across bitlines

Why bit-serial?

15

A + B

WL1

Row decoders

0

255

255BL/BLB

Logic

WL2

Array A

Array B

A + B



}

}

}

C

S S S S

CCCarry propagation across bitlines

High complexity

Loss of throughput and efficiency

!

!Bit-parallel arithmetic

Why bit-serial?

16

A + B

Row decoders

0

255

255BL/BLB

Logic

Bit-serial arithmetic

Why bit-serial?

17

A + B

Row decoders

0

255

255BL/BLB

Sum

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S


Transposed data

0 0 0 0

Why bit-serial?

18

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S


Transposed data


0 0 0 0

Cycle 1

Why bit-serial?

19

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S


Transposed data


C C C C

Cycle 2

Why bit-serial?

20

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S


Transposed data


C C C C

Cycle 3

Why bit-serial?

21

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S


Transposed data


C C C C

Cycle 4

Low area complexity

High throughput

Configurable & High precision

✓

✓

✓

Why bit-serial?

Outline

• Motivation

• Bit-Serial Arithmetic

• Transpose

• Mapping of Convolution to Array

• Methodology

• Results

22

23


Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU



Row decoders

0

255

255

= A + B

BL/BLB

Logic

Arr

ay A

Arr

ay B

0

1

1

0

0

0

1

1

1

0

0

1A +

B

Way

2

In-SRAM Arithmetic


Logical Operations In-SRAM

BLB0 BL0 BLBn BLn

SA

Ro

w D

eco

der

SADifferential

Sense Amplifiers

Bitlines

Wordlines

Ro

w D

eco

der

-O

Changes

SA SA

Vref

SA SA

Vref Single-endedSense Amplifiers

Additional row decoder

Reconfigurablesense amplifiers

24

SA SA

Vref


BLB0 BL0 BLBn BLn

Ro

w D

eco

der

Ro

w D

eco

der

SA SA


A AND B

A

B

BA

0 1

0 11 0

0 1

A AND B

10

25

SA SA

Vref

BLB0 BL0 BLBn BLn

Ro

w D

eco

der

Ro

w D

eco

der

SA SA


A

B

BA

0 1

0 11 0

0 1

A AND B

10

A NOR B

1 0

26


SA SA

Vref

Addition In-SRAM

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0 0Sum

P00 0

0 0

0 0

1

10

1A1

B0

B1

P1

P2

Ro

w D

eco

der

B

Ro

w D

eco

der

A P256 Bitlines

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA

SA

BL BLB

DR

S = A^B^C

27

1

SA SA

Vref

Addition [Cycle 1]

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0Sum

1

P00 0

0 0

0 0

1

1

10

1A1

B0

B1

P1

P2

Ro

w D

eco

der

B

Ro

w D

eco

der

A P

28

11

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

1 1

0 1

0 1

1 1

1 0

0 0

0 0

0 1Carry

Sum

1 1

A0

P0

A1

B0

B1

P1

P2

Ro

w D

eco

der

B

Ro

w D

eco

der

A P

Addition [Cycle 2]

29

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

1 1

0 1

0 1

1 1

1 0

1 1

0 0

0 1Carry

Sum

1

A0

P0

A1

B0

B1

P1

P2

Ro

w D

eco

der

P

Ro

w D

eco

der

Addition [Cycle 3]

30

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

Sum

P00 0

0 0

1

10

1A1

B0

B1

P1

Ro

w D

eco

der

Ro

w D

eco

der

0 0P2

0 000P3

0 0Tag

Multiplication In-SRAM

31

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

Sum

0 0

0 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

P0

P1

P2

P3

0 0Tag 1

Multiplication [Cycle 1]

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

32

1

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0 Sum

0 0

0 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 1Tag

1


P0 <- A0B0P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

33

1

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0 Sum

0 1

0 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 1Tag

1


P0 <- A0B0

P1 <- A1B0

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

34

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

Sum

0 1

0 1

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 0Tag1 1


P0 <- A0B0

P1 <- A1B0

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

35

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0Sum

0 1

0 1

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 1

0 000

1 1Tag

1

01

1


P0 <- A0B0

P1 <- A1B0 + A0B1

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

36

P1 <- P1 + A0B1

If(B1), P1 <- P1 + A0

Else, P1 <- P1

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 1Carry

0 0Sum

0 1

1 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 1

0 000

1 1Tag

0


P0 <- A0B0

P1 <- A1B0 + A0B1

P2 <- A1B1

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

37

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 1Carry

Sum

0 1

1 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 1Tag

1


P0 <- A0B0

P1 <- A1B0 + A0B1

P2 <- A1B1

P3 <- Cin

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

38

Operation Cycles

ADD N+1

SUB 2N+1

MUL N2 + 5N -2

DIV 1.5N2 + 5.5N

Comparison 2N+1

Supported Arithmetic

39

Synthesized array—7.5% area overhead Processor Chip— 2% area overhead

Outline

• Motivation


• Transpose


• Methodology

• Results

40

41

Way

1

Way

20

Way

2

Way

19

CBOXTMU

Transpose

Ro

w D

eco

der

A0[MSB]

A1[MSB]

A2[MSB]

A0[LSB]

A1[LSB]

A2[LSB]

... ...

... ...

...

...

...

...

...

...

Col Decoder

SA SA SA

DR

DR

DR

SADR

SA SA SA

DR

DR

DR

...

...

...

...

...

...

... ...

... ...

...

...

...

...

...

...

...

...

...

...

...

...

Control

SA

SA

SA

SA

SA

SA

SA

SA

SA

SA

DR

DR

DR

DR

DR

DR

DR

DR

DR

DR

DR

B0[MSB]

B1[MSB]

B2[MSB]

B0[LSB]

B1[LSB]

B2[LSB]

Regular read/write

Transp

ose

re

ad/w

rite

8-T transpose bit-cell

A2 A1 A0

B2 B1 B0

C2 C1 C0

TMU A0

A1

A2

C0

C1

C2

B0

B1

B2

Transpose

42

Outline

• Motivation

• Transpose



• Methodology

• Results

43

C

W

H

M

E

F

S

3D Filters (M)

each filter: C channels

each channel: RxS weights

1

C

R

S

M

C

R

Input Activations

(C channels)

Output Activations

(M channels)

44

A Convolutional Layer

RxS

C

. . .

RxS

C

. . .

Partial Sum

C 1 Output

Activation

MAC

∑Reduction

Filter

Weights

1

C

M

C

R

S

R

S

Input

Activations

CW

H

Output

Activations

M

E

F

. . .

Unroll Unroll

Mapping CNN to Neural Cache

45

256 W

ord

lines

Inp

ut

Acti

vati

on

Rx

Sx

8

256 Bitlines

8 kB SRAM Array

Weig

hts

Rx

Sx

8

Part

ial

Su

m

4x

8

. . .

C

Ou

tpu

t

4x

8

. . .

. . .

Way 20

2.5 MB LLC Slice

. . .

. . .

. . .

. . .

. . .Way 1 Way 2 Way 3

Qu

ad 1

Qu

ad 2

Qu

ad 3

Qu

ad 4

M = 32Output Position 1 Output Position 2 . . .

Mapping CNN to Neural Cache

256 W

ord

lines

Inp

ut

Ac

tivati

on

Rx

Sx

8

ch

an

ne

l 1

Filter 1 (C = 256)

256 Bitlines

8 kB SRAM Array

Weig

hts

Rx

Sx

8

Part

ial

Su

m

4x

8

ch

an

ne

l 2

ch

an

ne

l 3

ch

an

ne

l 2

56

ch

an

ne

l 4

. . .

. . .

. . .

. . .

M

E

F

46

. . .

. . .

Way

1 -

18

Way

19

-20

Way

1 -

18

Way

19

-20

Slice 1 Slice 14

Mapping of Convolution to Array

M

E

F

47

LLCSlice 1

LLCSlice 14

RingInterconnect

Core 14

DRAM

. . .

. . .

FilterWeights

Input Activations

Output Activations

Way 19(Reserved)

2.5 MB LLC Slice

. . .Way 1 Way 2 Way 3

Qu

ad 1

Qu

ad 2

Qu

ad 3

Qu

ad 4

. . .

. . .

. . .

. . .

Put it together

Core 1

Filter Loading1 Input Loading2 Output Transfer4 MAC + Reduction3

48

Outline

• Motivation

• Transpose



• Methodology

• Results

49

50

Evaluation Methodology

CPU (2 sockets) GPU (1 card) Neural Cache

Processor

Intel Xeon E5-2597 v3, 2.6GHz,

28 cores, 56 threads

Nvidia Titan Xp, 1.6GHz, 3840 cuda

cores

2.5GHz ComputeSRAM,

1032192Bit-serial ALUs

On-chip memory 78.96 MB 9.14 MB70 MB (Dual

Socket)

Off-chip memory 64 GB DRAM 12 GB DRAM 64 GB DRAM

Profiler / Simulator

(Performance)

TensorFlowtfprof

TensorFlowtfprof

Cycle accurate simulator +

C Microbench

Profiler / Simulator(Energy)

Intel RAPL InterfaceNVIDIA System Management

Interface

SPICEsimulation

+Intel RAPLInterface

DNN Models

- Inception V3

- 8-bit weights and inputs

Outline

• Motivation

• Transpose



• Methodology

• Results

51

0

100

200

300

400

500

600

700

1 4 16 64 256

Thro

ugh

pu

t (I

nfe

ren

ces

/ se

c)

Batch Size

CPU - Xeon E5 GPU - Titan Xp Neural Cache

Throughput

2.2x Improved throughput over GPU

0

20

40

60

80

100

CPU GPU Neural Cache

Late

ncy

(m

s)

Latency

7.7x Latency improvement over GPU

52

Power/Energy Comparison

53

0

20

40

60

80

100

120

0

1

2

3

4

5

6

7

8

9

10

CPU GPU Neural Cache

Pow

er (

Wat

ts)

Ener

gy (

Jou

les)

Total Energy Avg Power

12x 20x

.. over server class CPU at 2% area overhead

Neural Cache Summary

Repurpose Cache to Data Parallel DNN Accelerator

.. over server class GPU

2x 16x

54

Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan

Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bits Research Group

55

neural cache: bit-serial in-cache acceleration of deep ...2 transforming caches into massively...

Documents