the future of computing: domain-specific accelerators2019/10/15  · the future of computing:...

65
The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP of Research, NVIDIA Corporation Professor (Research), Stanford University

Upload: others

Post on 31-Dec-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

The Future of Computing:Domain-Specific Accelerators

MicroOctober 15, 2019

Bill DallyChief Scientist and SVP of Research, NVIDIA Corporation

Professor (Research), Stanford University

Page 2: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Faster Computing

BetterAlgorithms

MoreData

Value

Page 3: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

3

We need to continue delivering improved performance and perf/W

Page 4: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

4

But Process Technology isn’t Helping us Anymore

Moore’s Law is Dead

Page 5: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

John

Hen

ness

y an

d D

avid

Pat

ters

on, C

ompu

ter A

rchi

tect

ure:

A Q

uant

itativ

e Ap

proa

ch, 6

/e. 2

018

Page 6: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

6

Accelerators can continue scaling

perf and perf/W

Page 7: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Fast Accelerators since 1985• Mossim Simulation Engine: Dally, W.J. and Bryant, R.E., 1985. A hardware architecture for switch-

level simulation. IEEE Trans. CAD, 4(3), pp.239-250.• MARS Accelerator: Agrawal, P. and Dally, W.J., 1990. A hardware logic simulation system. IEEE

Trans. CAD, 9(1), pp.19-29.• Reconfigurable Arithmetic Processor: Fiske, S. and Dally, W.J., 1988. The reconfigurable arithmetic

processor . ISCA 1988.• Imagine: Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P. and Owens, J.D.,

2003. Programmable stream processors. Computer, 36(8), pp.54-62.• ELM: Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J. and

Sheffield, D., 2008. Efficient embedded computing. Computer, 41(7).• EIE: Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE:

efficient inference engine on compressed deep neural network, ISCA 2016• SCNN:Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler,

S.W. and Dally, W.J., 2017, June. Scnn: An accelerator for compressed-sparse convolutional neural networks, ISCA 2017

• Darwin: Turakhia, Bejerano, and Dally, “Darwin: A Genomics Co-processor provides up to 15,000×acceleration on long read assembly”, ASPLOS 2018.

• SATiN: Zhuo, Rucker, Wang, and Dally, “Hardware for Boolean Satisfiability Inference,” Under Review.

Page 8: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Accelerators Employ:• Massive Parallelism – >1,000x, not 16x – with Locality

• Special Data Types and Operations• Do in 1 cycle what normally takes 10s or 100s

• Optimized Memory• High bandwidth (and low energy) for specific data structures and operations

• Reduced or Amortized Overhead

• Algorithm-Architecture Co-Design8

Page 9: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Specialized Hardware is Everywhere

• Cell phones– Software on ARM cores

• High-complexity, low-compute work

– Accelerators do the heavy lifting• MODEMs• CODECs• Camera Image Processing• DNNs• Graphics

• GPUs– Rasterizer– Texture filter– Compositing– Compression/Decompression– Tensor computations– BVH Traversal– Ray-Triangle Intersection

• Does most of the work• But is mostly invisible

Page 10: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

10

Specialized Operations

Orders of Magnitude Efficiency

Moderate Speedup

Page 11: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Specialized Operations

Dynamic programming for gene sequence alignment (Smith-Waterman)

On 14nm CPU On 40nm Special Unit35 ALU ops, 15 load/store 1 cycle (37x speedup)37 cycles 3.1pJ (26,000x efficiency)81nJ 300fJ for logic (remainder is memory)

Page 12: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Why is a Specialized PE 26,000x More Efficient?

OOO CPU Instruction – 250pJ (99.99% overhead, ARM A-15)

Area is proportional to energy – all 28nm

16b Int Add, 32fJ

Evangelos Vasilakis. 2015. An Instruction Level Energy Characterization of Arm Processors. Foundation of Research and Technology Hellas, Inst. of Computer Science, Tech. Rep. FORTH-ICS/TR-450 (2015)

Page 13: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

13

Specialization -> Efficiency

Efficiency -> Parallelization

Parallelization -> Speedup

Page 14: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Specialized Operations

Dynamic programming for gene sequence alignment (Smith-Waterman)

Specialization -> 37x speedup, 26,000x efficiency, 270,000x for logic

Efficiency -> Parallelism 64 PE arrays x 64 PEs per array, 4,096x total

Speedup = 37 (Specialization) x 4,034 (Parallelism) = 150,000x total

Page 15: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Specialized Operations

Dynamic programming for gene sequence alignment (Smith-Waterman)

Specialization -> 37x speedup, 26,000x efficiency, 270,000x for logic

Efficiency -> Parallelism 64 PE arrays x 64 PEs per array, 4,096x total

Speedup = 37 (Specialization) x 4,034 (Parallelism) = 150,000x total

Page 16: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

16

Accelerator Design is Guided by Cost

Arithmetic is Free(particularly low-precision)

Memory is expensive

Communication is prohibitively expensive

Page 17: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Need to Understand Cost of OperationsAnd Communication

Operation: Energy (pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640

Area (µm2)366713713604184282349516407700N/AN/A

Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

Page 18: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Communication is Expensive, Be Small, Be Local

LPDDR DRAMGB

On-Chip SRAMMB

Local SRAMKB

640pJ/word

50pJ/word

5pJ/word

Page 19: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Scaling of Communication

0

50

100

150

200

250

300

350

DFMA 40nm DFMA 10nm Wire 40nm Wire 10nm

pJ

Keckler et al. Micro 2011.

Page 20: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

20

The Algorithm Often Has to Change

To Avoid Being Global Memory Limited

Page 21: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Algorithm-Architecture Co-Design for DarwinStart with Graphmap

21

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment

Graphmap~10K seeds~440M hits

Filtration

Alignment~3 hits

~1 hits

1. Graphmap (software)

1

Page 22: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Algorithm-Architecture Co-Design for DarwinReplace Graphmap with Hardware-Friendly AlgorithmsSpeed up Filtering by 100x, but 2.1x Slowdown Overall

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment

Graphmap~10K seeds~440M hits

Darwin~2K seeds~1M hits

Filtration(D-SOFT)

Alignment(GACT)

Filtration

Alignment~3 hits

~1 hits

~1680 hits

~1 hits

2.1X slowdown

1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)

1

2

Page 23: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Algorithm-Hardware Co-Design for DarwinAccelerate Alighment – 380x Speedup

23

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration

2.1X slowdown

380X speedup

1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration

1

2

3

Page 24: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Algorithm-Hardware Co-Design for Darwin4x Memory Parallelism – 3.9x Speeedup

24

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment

DRAM

DRAM

DRAM

DRAM

SPL

SPL

SPL

SPL

2.1X slowdown

380X speedup

3.9X speedup

1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT1

2

3

4

Page 25: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Algorithm-Hardware Co-Design for DarwinSpecialized Memory for D-Soft Bin Updates – 15.6x Speedup

25

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment

DRAM

DRAM

DRAM

DRAM

SPL

SPL

SPL

SPL

UBL

UBL

Bin-countSRAM

Bin-countSRAM

2.1X slowdown

380X speedup

3.9X speedup

15.6X speedup

1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to

SRAM (ASIC)

1

2

3

4

5

Page 26: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Algorithm-Hardware Co-Design for DarwinPipeline D-Soft and GACT – now completely D-Soft limited – 1.4x

Overall 15,000x

26

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to

SRAM (ASIC)6. Pipeline D-SOFT and GACT

2.1X slowdown

380X speedup

3.9X speedup

15.6X speedup

1.4X speedup

D-SOFT

Software

GACT60

GACT61

GACT62

GACT63

GACT0

GACT1

GACT2

GACT3

1

2

3

4

5

6

Page 27: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

27

Memory Dominates

Page 28: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

28

Memory dominates power and area

Page 29: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Memory Dominates

Page 30: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

31

Algorithms must be memory optimized

Minimize global memory accesses

Keep local memory footprint small

Page 31: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

GACT Alignment• 15M Reads, 10k bases each, ~2k hits each

– ~300T Alignments to be done– Additional parallelism within each alignment

• But long reads have large (10M) memory footprint• Solution: GACT (Tiling)

Page 32: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

GACT Alignment• 15M Reads, 10k bases each, ~2k hits each

– ~300T Alignments to be done– Additional parallelism within each alignment

• But long reads have large (10M) memory footprint• Solution: GACT (Tiling)

Darwin GACT hardware4k PEs - 64 PEs per Array x 64 Arrays~50 operations per cycle per PE200k operations per cycleSpecialized memory150,000x speedup vs CPU

Page 33: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

34

On-Chip Memory Cost per Bit is 10-100x Commodity DRAM

And It’s Often Less Expensive

Page 34: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Algorithm Overview

AGCTTTACCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

0 -inf

0 -inf

0 -inf

0 -inf

0 -inf

Bin count (bases)

Last hit offset

Bin 1 Bin 2 Bin 3 Bin 4 Bin 5

Slope=1

Page 35: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Algorithm Overview

AGCTTTACCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

.

.

.

21

23

29

.

.

.

.

38

12

31

5

.

.

.

.

.

20

21

22

23

.

.

.

.

.

.

GG

GT

TA

.

.

0 -inf

0 -inf

2 0

2 0

0 -inf

Bin count (bases)

Last hit offset

Pointer Table Position Table

Page 36: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Algorithm Overview

AGCTTTACCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

.

.

.

17

21

21

.

.

.

.

1

15

18

38

.

.

.

.

.

17

18

19

20

21

.

.

.

.

.

GA

GC

GG

.

.

2 2

0 -inf

4 2

2 0

2 2

Bin count (bases)

Last hit offset

Pointer Table Position Table

Page 37: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Algorithm Overview

AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

.

.

12

17

17

.

.

.

.

2

8

16

22

28

1

.

.

.

12

13

14

15

16

17

.

.

.

.

CG

CT

GA

.

.

.

3 3

2 3

5 3

4 3

2 2

Bin count (bases)

Last hit offset

Pointer Table Position Table

Page 38: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Algorithm Overview

AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

.

.

.

.

.

32

33

39

.

.

.

3

4

25

26

34

35

.

.

.

33

34

35

36

37

38

.

.

.

.

.

TC

TG

TT

4 4

2 3

5 3

5 4

2 2

Bin count (bases)

Last hit offset

Pointer Table Position Table

Page 39: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Algorithm Overview

AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

.

.

.

.

.

32

33

39

.

.

.

.

17

3

4

.

.

.

.

.

.

32

33

34

.

.

.

.

.

.

.

TC

TG

TT

4 4

2 3

7 5

5 4

2 2

Bin count (bases)

Last hit offset

Pointer Table Position Table

Page 40: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Algorithm Overview

AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

Parameters:

k: seed sizeN: number of seedsh: threshold on non-overlapping bases

B: bin size (number of bases, fixed to 128)

(k=2, N=6, h=6)

4 4

2 3

7 5

5 4

2 2

Bin count (bases)

Last hit offset

Page 41: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

D-SOFT: Hardware-acceleration

R0

R15

R16

R31

Arbiter

Bin-countSRAM 1

Update-bin logic (UBL)

NZ binsSRAM

Update-bin logic (UBL)

Bin-countSRAM 16

NZ binsSRAM

Network-on-chip (16-endpoint

Butterfly)Seed-position lookup (SPL)

(seed, j) candidate_pos

DRAM

(bin, j)

Seed-position lookup (SPL)

Seed-position lookup (SPL)

Seed-position lookup (SPL)

DRAM

DRAM

DRAM

AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

candidate_pos

Page 42: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Cost has a Time Component

C = T(B1N1 + B2N2 + … + P)

T B1 N1 B2 N2 CDarwin Filter 1 100 64M 1 128G 134GAll DRAM 15.6 1 128G 1,997G

Page 43: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

44

Hardware Enables Irregular, Compressed Data Structures

(reduces memory footprint)

Page 44: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Dynamic Sparse Activations, Static Sparse Weights~a

�0 a1 0 a3

⇥ ~bPE0

PE1

PE2

PE3

0

BBBBBBBBBBBBB@

w0,0w0,1 0 w0,3

0 0 w1,2 0

0 w2,1 0 w2,3

0 0 0 0

0 0 w4,2w4,3

w5,0 0 0 0

0 0 0 w6,3

0 w7,1 0 0

1

CCCCCCCCCCCCCA

=

0

BBBBBBBBBBBBB@

b0b1

�b2b3

�b4b5b6

�b7

1

CCCCCCCCCCCCCA

ReLU)

0

BBBBBBBBBBBBB@

b0b10

b30

b5b60

1

CCCCCCCCCCCCCA

1

Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3

Relative Index 0 1 2 0 0

Column Pointer 0 1 2 3

Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE: efficient inference engine on compressed deep neural network, ISCA 2016

Page 45: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Trained Quantization

Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

Page 46: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

EIE HardwarePE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Central Control

Page 47: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

48

Simple Parallelism Often Beats a “Better” Algorithm

Page 48: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Broadcast Literals to Associative Memory of ClausesMore Comparisons but Lower Latency

Page 49: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

50

Need to Accelerate the Whole Problem

Page 50: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Implication 300-500x Faster than CPU

Table 1

CPU Accelerator

bench_7666.smt2.cnf 2105 2.75 0.58 0.40 3.73

bench_7222.smt2.cnf 1180 3.47 0.6 0.5 4.57

bench_8061.smt2.cnf 1357 3.63 0.66 0.54 4.83

bench_13535.smt2.cnf 923 2.14 0.35 0.61 3.1

qg01-08.cnf 1993 2.07 0.26 0.44 2.77

SAT_instance_N=33.cnf 775 1.02 0.18 0.12 1.32

Number of Cycles per Boolean Constraint Propagation

1.00E+00

1.00E+01

1.00E+02

1.00E+03

bench_7666.smt2.cnf bench_7222.smt2.cnf bench_8061.smt2.cnf bench_13535.smt2.cnf qg01-08.cnf SAT_instance_N=33.cnf

CPU Accelerator

�1

Page 51: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Need to Accelerate the Whole Problem

Page 52: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

57

Platforms for Acceleration

Page 53: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

58

Implementation Alternatives

GD

DR6

LPD

DR4

Page 54: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

GPUs Provide:• High-Bandwidth, Hierarchical Memory System

• Can be configured to match application

• Programmable Control and Operand Delivery

• Simple places to bolt on Domain-Specific Hardware• As instructions or memory clients

60

Page 55: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Volta V10021B xtors | TSMC 12nm FFN | 815mm2

5,120 CUDA cores7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS125 Tensor TFLOPS20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s300 GB/s NVLink

Page 56: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

62

Tensor Core

D = AB + C

D =

FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Page 57: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Specialized Instructions Amortize Overhead

Operation Ops Energy** Overhead*Vs op %tot

HFMA 2 1.5pJ 20x 95%HDP4A 8 6.0pJ 5x 83%HMMA 128 130pJ 0.23x 19%IMMA 1024 230pJ 0.13x 12%

*Overhead is instruction fetch, decode, and operand fetch – 30pJ**Energy numbers from 45nm process

Page 58: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

(map force (pairs

particles)

Mapping DirectivesProgram

Mapper &Runtime

GPU Data & Task Placement

Synthesis

Custom Compute Blocks (Instructions or Clients)

SMs

Configurable MemoryEfficient NoC

Page 59: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

DSA Design is ProgrammingWith a Hardware Cost Model

Algorithm Mapping

Page 60: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

66

Implementation Alternatives

GD

DR6

LPD

DR4

DPS

TEP

Page 61: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Multi-Chip Modules (MCMs)

Page 62: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Total Cost of Ownership

de-novo assembly of noisy long-reads 50x coverage

Page 63: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

76

Conclusion

Page 64: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP

Summary• Moore’s Law is over, but we must continue scaling perf/W• Accelerators are the future

– Specialization -> Efficiency– Parallelism -> Speedup– Co-Design: The algorithm has to change

• Memory Dominates: – Minimize global memory access– Minimize memory footprint – new algorithms, sparsity, compression– Lots of small, fast on-chip memories

• GPUs as accelerator platforms– GPUs – efficient memory, communication and control– Custom blocks – instructions or clients

• DSA design is programming – with a hardware cost model

Page 65: The Future of Computing: Domain-Specific Accelerators2019/10/15  · The Future of Computing: Domain-Specific Accelerators Micro October 15, 2019 Bill Dally Chief Scientist and SVP