neural cache: bit-serial in-cache acceleration of deep ...2 transforming caches into massively...

55
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research Group 1

Upload: others

Post on 09-Oct-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan

Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bits Research Group

1

Page 2: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

2

CPU GPU

Can we transform CPU into a neural accelerator?

$

Page 3: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

3

GPU

Can we transform CPU into a neural accelerator?

CPU Neural Cache

++ Parallelism

-- Data Movement

Page 4: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Transforming caches into massively parallel vector ALUs

4

18-core Xeon processor45 MB LLC

18 LLC slices

Page 5: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Transforming caches into massively parallel vector ALUs

5

18-core Xeon processor45 MB LLC

Way

1

Way

20

Way

2

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

18 LLC slices 360 ways

Page 6: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Transforming caches into massively parallel vector ALUs

6

18-core Xeon processor45 MB LLC

Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

WL

Row decoder

0

255

255BL/BLB

8kB SRAM array

18 LLC slices 360 ways 5760 arrays

Way

2

Page 7: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Transforming caches into massively parallel vector ALUs

7

18-core Xeon processor45 MB LLC

Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Row decoders

0

255

255

= A + B

BL/BLB

Logic

Arr

ay A

Arr

ay B

0

1

1

0

0

0

1

1

1

0

0

1A +

B

18 LLC slices 360 ways 5760 arrays

Way

2

Page 8: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Transforming caches into massively parallel vector ALUs

8

18-core Xeon processor45 MB LLC

Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Row decoders

0

255

255

= A + B

BL/BLB

Logic

Arr

ay A

Arr

ay B

0

1

1

0

0

0

1

1

1

0

0

1A +

B

Way

2

Page 9: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Way

2

Transforming caches into massively parallel vector ALUs

9

18-core Xeon processor45 MB LLC

Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

WL

Row decoders

0

255

255

= A + B

BL/BLB

Logic

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

Array AArray B

A + B

Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs

✓ Multiply ✓ Divide ✓ Add

Bit-serial operation @2.5 GHz

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

Configurable Precision

Page 10: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

10

A + B

Row decoders

0

255

255BL/BLB

Logic

Bit-parallel arithmetic

Why bit-serial?

Page 11: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

11

A + B

Row decoders

0

255

255BL/BLB

Logic

Array A

Array B

A + B

Word 3Word 2Word 1Word 0

Word 3Word 2Word 1Word 0

}

}

}

Bit-parallel arithmetic

Why bit-serial?

Page 12: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

12

A + B

WL1

Row decoders

0

255

255

S

BL/BLB

Logic

WL2

Array A

Array B

A + B

Word 3Word 2Word 1Word 0

Word 3Word 2Word 1Word 0

}

}

}

Bit-parallel arithmetic

Why bit-serial?

Page 13: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

13

A + B

WL1

Row decoders

0

255

255BL/BLB

Logic

WL2

Array A

Array B

A + B

Word 3Word 2Word 1Word 0

Word 3Word 2Word 1Word 0

}

}

}

C

S S

Bit-parallel arithmetic

Carry propagation across bitlines

Why bit-serial?

Page 14: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

14

A + B

WL1

Row decoders

0

255

255BL/BLB

Logic

WL2

Array A

Array B

A + B

Word 3Word 2Word 1Word 0

Word 3Word 2Word 1Word 0

}

}

}

C

S S S

C

Bit-parallel arithmetic

Carry propagation across bitlines

Why bit-serial?

Page 15: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

15

A + B

WL1

Row decoders

0

255

255BL/BLB

Logic

WL2

Array A

Array B

A + B

Word 3Word 2Word 1Word 0

Word 3Word 2Word 1Word 0

}

}

}

C

S S S S

CCCarry propagation across bitlines

High complexity

Loss of throughput and efficiency

!

!Bit-parallel arithmetic

Why bit-serial?

Page 16: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

16

A + B

Row decoders

0

255

255BL/BLB

Logic

Bit-serial arithmetic

Why bit-serial?

Page 17: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

17

A + B

Row decoders

0

255

255BL/BLB

Sum

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S

Bit-serial arithmetic

Transposed data

0 0 0 0

Why bit-serial?

Page 18: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

18

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

0 0 0 0

Cycle 1

Why bit-serial?

Page 19: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

19

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

C C C C

Cycle 2

Why bit-serial?

Page 20: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

20

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

C C C C

Cycle 3

Why bit-serial?

Page 21: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

21

A + B

WL1

Row decoders

0

255

255BL/BLB

Sum

WL2

Carry

Arr

ay A

Arr

ay B

A +

B

Wo

rd 3

Wo

rd 2

Wo

rd 1

Wo

rd 0

}

}

}

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

C C C C

Cycle 4

Low area complexity

High throughput

Configurable & High precision

Why bit-serial?

Page 22: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Outline

• Motivation

• Bit-Serial Arithmetic

• Transpose

• Mapping of Convolution to Array

• Methodology

• Results

22

Page 23: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

23

18-core Xeon processor45 MB LLC

Way

1

Way

20

Way

19

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Row decoders

0

255

255

= A + B

BL/BLB

Logic

Arr

ay A

Arr

ay B

0

1

1

0

0

0

1

1

1

0

0

1A +

B

Way

2

In-SRAM Arithmetic

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

Page 24: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Logical Operations In-SRAM

BLB0 BL0 BLBn BLn

SA

Ro

w D

eco

der

SADifferential

Sense Amplifiers

Bitlines

Wordlines

Ro

w D

eco

der

-O

Changes

SA SA

Vref

SA SA

Vref Single-endedSense Amplifiers

Additional row decoder

Reconfigurablesense amplifiers

24

Page 25: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

Logical Operations In-SRAM

BLB0 BL0 BLBn BLn

Ro

w D

eco

der

Ro

w D

eco

der

SA SA

Vref Single-endedSense Amplifiers

A AND B

A

B

BA

0 1

0 11 0

0 1

A AND B

10

25

Page 26: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

Ro

w D

eco

der

Ro

w D

eco

der

SA SA

Vref Single-endedSense Amplifiers

A

B

BA

0 1

0 11 0

0 1

A AND B

10

A NOR B

1 0

26

Logical Operations In-SRAM

Page 27: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

Addition In-SRAM

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0 0Sum

P00 0

0 0

0 0

1

10

1A1

B0

B1

P1

P2

Ro

w D

eco

der

B

Ro

w D

eco

der

A P256 Bitlines

D EN

QC

A&B

A^B

SCout

Cin

Vref

C_EN

~A & ~B

SA

SA

BL BLB

DR

S = A^B^C

27

Page 28: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

1

SA SA

Vref

Addition [Cycle 1]

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0Sum

1

P00 0

0 0

0 0

1

1

10

1A1

B0

B1

P1

P2

Ro

w D

eco

der

B

Ro

w D

eco

der

A P

28

Page 29: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

11

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

1 1

0 1

0 1

1 1

1 0

0 0

0 0

0 1Carry

Sum

1 1

A0

P0

A1

B0

B1

P1

P2

Ro

w D

eco

der

B

Ro

w D

eco

der

A P

Addition [Cycle 2]

29

Page 30: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

1 1

0 1

0 1

1 1

1 0

1 1

0 0

0 1Carry

Sum

1

A0

P0

A1

B0

B1

P1

P2

Ro

w D

eco

der

P

Ro

w D

eco

der

Addition [Cycle 3]

30

Page 31: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

Sum

P00 0

0 0

1

10

1A1

B0

B1

P1

Ro

w D

eco

der

Ro

w D

eco

der

0 0P2

0 000P3

0 0Tag

Multiplication In-SRAM

31

Page 32: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

Sum

0 0

0 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

P0

P1

P2

P3

0 0Tag 1

Multiplication [Cycle 1]

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

32

Page 33: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

1

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0 Sum

0 0

0 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 1Tag

1

Multiplication [Cycle 2]

P0 <- A0B0P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

33

Page 34: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

1

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0 Sum

0 1

0 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 1Tag

1

Multiplication [Cycle 3]

P0 <- A0B0

P1 <- A1B0

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

34

Page 35: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

Sum

0 1

0 1

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 0Tag1 1

Multiplication [Cycle 4]

P0 <- A0B0

P1 <- A1B0

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

35

Page 36: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 0Carry

0Sum

0 1

0 1

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 1

0 000

1 1Tag

1

01

1

Multiplication [Cycle 5]

P0 <- A0B0

P1 <- A1B0 + A0B1

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

36

P1 <- P1 + A0B1

If(B1), P1 <- P1 + A0

Else, P1 <- P1

Page 37: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 1Carry

0 0Sum

0 1

1 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 1

0 000

1 1Tag

0

Multiplication [Cycle 6]

P0 <- A0B0

P1 <- A1B0 + A0B1

P2 <- A1B1

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

37

Page 38: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

SA SA

Vref

BLB0 BL0 BLBn BLn

SA SA

Vref

A0

0 1

1 1

0 1Carry

Sum

0 1

1 0

1

10

1A1

B0

B1

Ro

w D

eco

der

Ro

w D

eco

der

0 0

0 000

0 1Tag

1

Multiplication [Cycle 7]

P0 <- A0B0

P1 <- A1B0 + A0B1

P2 <- A1B1

P3 <- Cin

P0

P1

P2

P3

A1B0 A0B0

A1B1 A0B1

A1 A0

XB1 B0

P0P1P2

38

Page 39: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Operation Cycles

ADD N+1

SUB 2N+1

MUL N2 + 5N -2

DIV 1.5N2 + 5.5N

Comparison 2N+1

Supported Arithmetic

39

Synthesized array—7.5% area overhead Processor Chip— 2% area overhead

Page 40: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Outline

• Motivation

• Bit-Serial Arithmetic

• Transpose

• Mapping of Convolution to Array

• Methodology

• Results

40

Page 41: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

41

Way

1

Way

20

Way

2

Way

19

CBOXTMU

Transpose

Ro

w D

eco

der

A0[MSB]

A1[MSB]

A2[MSB]

A0[LSB]

A1[LSB]

A2[LSB]

... ...

... ...

...

...

...

...

...

...

Col Decoder

SA SA SA

DR

DR

DR

SADR

SA SA SA

DR

DR

DR

...

...

...

...

...

...

... ...

... ...

...

...

...

...

...

...

...

...

...

...

...

...

Control

SA

SA

SA

SA

SA

SA

SA

SA

SA

SA

DR

DR

DR

DR

DR

DR

DR

DR

DR

DR

DR

B0[MSB]

B1[MSB]

B2[MSB]

B0[LSB]

B1[LSB]

B2[LSB]

Regular read/write

Transp

ose

re

ad/w

rite

8-T transpose bit-cell

Page 42: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

A2 A1 A0

B2 B1 B0

C2 C1 C0

TMU A0

A1

A2

C0

C1

C2

B0

B1

B2

Transpose

42

Page 43: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Outline

• Motivation

• Transpose

• Bit-Serial Arithmetic

• Mapping of Convolution to Array

• Methodology

• Results

43

Page 44: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

C

W

H

M

E

F

S

3D Filters (M)

each filter: C channels

each channel: RxS weights

1

C

R

S

M

C

R

Input Activations

(C channels)

Output Activations

(M channels)

44

A Convolutional Layer

Page 45: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

RxS

C

. . .

RxS

C

. . .

Partial Sum

C 1 Output

Activation

MAC

∑Reduction

Filter

Weights

1

C

M

C

R

S

R

S

Input

Activations

CW

H

Output

Activations

M

E

F

. . .

Unroll Unroll

Mapping CNN to Neural Cache

45

256 W

ord

lines

Inp

ut

Acti

vati

on

Rx

Sx

8

256 Bitlines

8 kB SRAM Array

Weig

hts

Rx

Sx

8

Part

ial

Su

m

4x

8

. . .

C

Ou

tpu

t

4x

8

. . .

. . .

Page 46: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Way 20

2.5 MB LLC Slice

. . .

. . .

. . .

. . .

. . .Way 1 Way 2 Way 3

Qu

ad 1

Qu

ad 2

Qu

ad 3

Qu

ad 4

M = 32Output Position 1 Output Position 2 . . .

Mapping CNN to Neural Cache

256 W

ord

lines

Inp

ut

Ac

tivati

on

Rx

Sx

8

ch

an

ne

l 1

Filter 1 (C = 256)

256 Bitlines

8 kB SRAM Array

Weig

hts

Rx

Sx

8

Part

ial

Su

m

4x

8

ch

an

ne

l 2

ch

an

ne

l 3

ch

an

ne

l 2

56

ch

an

ne

l 4

. . .

. . .

. . .

. . .

M

E

F

46

. . .

. . .

Page 47: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Way

1 -

18

Way

19

-20

Way

1 -

18

Way

19

-20

Slice 1 Slice 14

Mapping of Convolution to Array

M

E

F

47

Page 48: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

LLCSlice 1

LLCSlice 14

RingInterconnect

Core 14

DRAM

. . .

. . .

FilterWeights

Input Activations

Output Activations

Way 19(Reserved)

2.5 MB LLC Slice

. . .Way 1 Way 2 Way 3

Qu

ad 1

Qu

ad 2

Qu

ad 3

Qu

ad 4

. . .

. . .

. . .

. . .

Put it together

Core 1

Filter Loading1 Input Loading2 Output Transfer4 MAC + Reduction3

48

Page 49: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Outline

• Motivation

• Transpose

• Bit-Serial Arithmetic

• Mapping of Convolution to Array

• Methodology

• Results

49

Page 50: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

50

Evaluation Methodology

CPU (2 sockets) GPU (1 card) Neural Cache

Processor

Intel Xeon E5-2597 v3, 2.6GHz,

28 cores, 56 threads

Nvidia Titan Xp, 1.6GHz, 3840 cuda

cores

2.5GHz ComputeSRAM,

1032192Bit-serial ALUs

On-chip memory 78.96 MB 9.14 MB70 MB (Dual

Socket)

Off-chip memory 64 GB DRAM 12 GB DRAM 64 GB DRAM

Profiler / Simulator

(Performance)

TensorFlowtfprof

TensorFlowtfprof

Cycle accurate simulator +

C Microbench

Profiler / Simulator(Energy)

Intel RAPL InterfaceNVIDIA System Management

Interface

SPICEsimulation

+Intel RAPLInterface

DNN Models

- Inception V3

- 8-bit weights and inputs

Page 51: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Outline

• Motivation

• Transpose

• Bit-Serial Arithmetic

• Mapping of Convolution to Array

• Methodology

• Results

51

Page 52: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

0

100

200

300

400

500

600

700

1 4 16 64 256

Thro

ugh

pu

t (I

nfe

ren

ces

/ se

c)

Batch Size

CPU - Xeon E5 GPU - Titan Xp Neural Cache

Throughput

2.2x Improved throughput over GPU

0

20

40

60

80

100

CPU GPU Neural Cache

Late

ncy

(m

s)

Latency

7.7x Latency improvement over GPU

52

Page 53: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Power/Energy Comparison

53

0

20

40

60

80

100

120

0

1

2

3

4

5

6

7

8

9

10

CPU GPU Neural Cache

Pow

er (

Wat

ts)

Ener

gy (

Jou

les)

Total Energy Avg Power

Page 54: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

12x 20x

.. over server class CPU at 2% area overhead

Neural Cache Summary

Repurpose Cache to Data Parallel DNN Accelerator

.. over server class GPU

2x 16x

54

Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs

Page 55: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan

Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bits Research Group

55