the future of computing: domain-specific accelerators2019/10/15 · the future of computing:...

The Future of Computing:Domain-Specific Accelerators

MicroOctober 15, 2019

Bill DallyChief Scientist and SVP of Research, NVIDIA Corporation

Professor (Research), Stanford University

Faster Computing

BetterAlgorithms

MoreData

Value

3

We need to continue delivering improved performance and perf/W

4

But Process Technology isn’t Helping us Anymore

Moore’s Law is Dead

John

Hen

ness

y an

d D

avid

Pat

ters

on, C

ompu

ter A

rchi

tect

ure:

A Q

uant

itativ

e Ap

proa

ch, 6

/e. 2

018

6

Accelerators can continue scaling

perf and perf/W

Fast Accelerators since 1985• Mossim Simulation Engine: Dally, W.J. and Bryant, R.E., 1985. A hardware architecture for switch-

level simulation. IEEE Trans. CAD, 4(3), pp.239-250.• MARS Accelerator: Agrawal, P. and Dally, W.J., 1990. A hardware logic simulation system. IEEE

Trans. CAD, 9(1), pp.19-29.• Reconfigurable Arithmetic Processor: Fiske, S. and Dally, W.J., 1988. The reconfigurable arithmetic

processor . ISCA 1988.• Imagine: Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P. and Owens, J.D.,

2003. Programmable stream processors. Computer, 36(8), pp.54-62.• ELM: Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J. and

Sheffield, D., 2008. Efficient embedded computing. Computer, 41(7).• EIE: Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE:

efficient inference engine on compressed deep neural network, ISCA 2016• SCNN:Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler,

S.W. and Dally, W.J., 2017, June. Scnn: An accelerator for compressed-sparse convolutional neural networks, ISCA 2017

• Darwin: Turakhia, Bejerano, and Dally, “Darwin: A Genomics Co-processor provides up to 15,000×acceleration on long read assembly”, ASPLOS 2018.

• SATiN: Zhuo, Rucker, Wang, and Dally, “Hardware for Boolean Satisfiability Inference,” Under Review.

Accelerators Employ:• Massive Parallelism – >1,000x, not 16x – with Locality

• Special Data Types and Operations• Do in 1 cycle what normally takes 10s or 100s

• Optimized Memory• High bandwidth (and low energy) for specific data structures and operations

• Reduced or Amortized Overhead

• Algorithm-Architecture Co-Design8

Specialized Hardware is Everywhere

• Cell phones– Software on ARM cores

• High-complexity, low-compute work

– Accelerators do the heavy lifting• MODEMs• CODECs• Camera Image Processing• DNNs• Graphics

• GPUs– Rasterizer– Texture filter– Compositing– Compression/Decompression– Tensor computations– BVH Traversal– Ray-Triangle Intersection

• Does most of the work• But is mostly invisible

10

Specialized Operations

Orders of Magnitude Efficiency

Moderate Speedup


Dynamic programming for gene sequence alignment (Smith-Waterman)

On 14nm CPU On 40nm Special Unit35 ALU ops, 15 load/store 1 cycle (37x speedup)37 cycles 3.1pJ (26,000x efficiency)81nJ 300fJ for logic (remainder is memory)

Why is a Specialized PE 26,000x More Efficient?

OOO CPU Instruction – 250pJ (99.99% overhead, ARM A-15)

Area is proportional to energy – all 28nm

16b Int Add, 32fJ

Evangelos Vasilakis. 2015. An Instruction Level Energy Characterization of Arm Processors. Foundation of Research and Technology Hellas, Inst. of Computer Science, Tech. Rep. FORTH-ICS/TR-450 (2015)

13

Specialization -> Efficiency

Efficiency -> Parallelization

Parallelization -> Speedup


Dynamic programming for gene sequence alignment (Smith-Waterman)

Specialization -> 37x speedup, 26,000x efficiency, 270,000x for logic

Efficiency -> Parallelism 64 PE arrays x 64 PEs per array, 4,096x total

Speedup = 37 (Specialization) x 4,034 (Parallelism) = 150,000x total

16

Accelerator Design is Guided by Cost

Arithmetic is Free(particularly low-precision)

Memory is expensive

Communication is prohibitively expensive

Need to Understand Cost of OperationsAnd Communication

Operation: Energy (pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640

Area (µm2)366713713604184282349516407700N/AN/A

Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

Communication is Expensive, Be Small, Be Local

LPDDR DRAMGB

On-Chip SRAMMB

Local SRAMKB

640pJ/word

50pJ/word

5pJ/word

Scaling of Communication

0

50

100

150

200

250

300

350

DFMA 40nm DFMA 10nm Wire 40nm Wire 10nm

pJ

Keckler et al. Micro 2011.

20

The Algorithm Often Has to Change

To Avoid Being Global Memory Limited

Algorithm-Architecture Co-Design for DarwinStart with Graphmap

21

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment

Graphmap~10K seeds~440M hits

Filtration

Alignment~3 hits

~1 hits

1. Graphmap (software)

1

Algorithm-Architecture Co-Design for DarwinReplace Graphmap with Hardware-Friendly AlgorithmsSpeed up Filtering by 100x, but 2.1x Slowdown Overall

0.1 1 10 100 1000 10000 100000Time/read (ms)


Graphmap~10K seeds~440M hits

Darwin~2K seeds~1M hits

Filtration(D-SOFT)

Alignment(GACT)

Filtration

Alignment~3 hits

~1 hits

~1680 hits

~1 hits

2.1X slowdown

1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)

1

2

Algorithm-Hardware Co-Design for DarwinAccelerate Alighment – 380x Speedup

23

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration

2.1X slowdown

380X speedup


(software)3. GACT hardware-acceleration

1

2

3

Algorithm-Hardware Co-Design for Darwin4x Memory Parallelism – 3.9x Speeedup

24

0.1 1 10 100 1000 10000 100000Time/read (ms)


DRAM

DRAM

DRAM

DRAM

SPL

SPL

SPL

SPL

2.1X slowdown

380X speedup

3.9X speedup


(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT1

2

3

4

Algorithm-Hardware Co-Design for DarwinSpecialized Memory for D-Soft Bin Updates – 15.6x Speedup

25

0.1 1 10 100 1000 10000 100000Time/read (ms)


DRAM

DRAM

DRAM

DRAM

SPL

SPL

SPL

SPL

UBL

UBL

Bin-countSRAM

Bin-countSRAM

2.1X slowdown

380X speedup

3.9X speedup

15.6X speedup


(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to

SRAM (ASIC)

1

2

3

4

5

Algorithm-Hardware Co-Design for DarwinPipeline D-Soft and GACT – now completely D-Soft limited – 1.4x

Overall 15,000x

26

0.1 1 10 100 1000 10000 100000Time/read (ms)

Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to

SRAM (ASIC)6. Pipeline D-SOFT and GACT

2.1X slowdown

380X speedup

3.9X speedup

15.6X speedup

1.4X speedup

D-SOFT

Software

GACT60

GACT61

GACT62

GACT63

GACT0

GACT1

GACT2

GACT3

1

2

3

4

5

6

27

Memory Dominates

28

Memory dominates power and area

Memory Dominates

31

Algorithms must be memory optimized

Minimize global memory accesses

Keep local memory footprint small

GACT Alignment• 15M Reads, 10k bases each, ~2k hits each

– ~300T Alignments to be done– Additional parallelism within each alignment

• But long reads have large (10M) memory footprint• Solution: GACT (Tiling)

GACT Alignment• 15M Reads, 10k bases each, ~2k hits each

– ~300T Alignments to be done– Additional parallelism within each alignment

• But long reads have large (10M) memory footprint• Solution: GACT (Tiling)

Darwin GACT hardware4k PEs - 64 PEs per Array x 64 Arrays~50 operations per cycle per PE200k operations per cycleSpecialized memory150,000x speedup vs CPU

34

On-Chip Memory Cost per Bit is 10-100x Commodity DRAM

And It’s Often Less Expensive

D-SOFT: Algorithm Overview

AGCTTTACCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

0 -inf

0 -inf

0 -inf

0 -inf

0 -inf

Bin count (bases)

Last hit offset

Bin 1 Bin 2 Bin 3 Bin 4 Bin 5

Slope=1



GTGCTTGGATATA

.

.

.

21

23

29

.

.

.

.

38

12

31

5

.

.

.

.

.

20

21

22

23

.

.

.

.

.

.

GG

GT

TA

.

.

0 -inf

0 -inf

2 0

2 0

0 -inf

Bin count (bases)

Last hit offset

Pointer Table Position Table



GTGCTTGGATATA

.

.

.

17

21

21

.

.

.

.

1

15

18

38

.

.

.

.

.

17

18

19

20

21

.

.

.

.

.

GA

GC

GG

.

.

2 2

0 -inf

4 2

2 0

2 2

Bin count (bases)

Last hit offset



AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC

GTGCTTGGATATA

.

.

12

17

17

.

.

.

.

2

8

16

22

28

1

.

.

.

12

13

14

15

16

17

.

.

.

.

CG

CT

GA

.

.

.

3 3

2 3

5 3

4 3

2 2

Bin count (bases)

Last hit offset




GTGCTTGGATATA

.

.

.

.

.

32

33

39

.

.

.

3

4

25

26

34

35

.

.

.

33

34

35

36

37

38

.

.

.

.

.

TC

TG

TT

4 4

2 3

5 3

5 4

2 2

Bin count (bases)

Last hit offset




GTGCTTGGATATA

.

.

.

.

.

32

33

39

.

.

.

.

17

3

4

.

.

.

.

.

.

32

33

34

.

.

.

.

.

.

.

TC

TG

TT

4 4

2 3

7 5

5 4

2 2

Bin count (bases)

Last hit offset




GTGCTTGGATATA

Parameters:

k: seed sizeN: number of seedsh: threshold on non-overlapping bases

B: bin size (number of bases, fixed to 128)

(k=2, N=6, h=6)

4 4

2 3

7 5

5 4

2 2

Bin count (bases)

Last hit offset

D-SOFT: Hardware-acceleration

R0

R15

R16

R31

Arbiter

Bin-countSRAM 1

Update-bin logic (UBL)

NZ binsSRAM

Update-bin logic (UBL)

Bin-countSRAM 16

NZ binsSRAM

Network-on-chip (16-endpoint

Butterfly)Seed-position lookup (SPL)

(seed, j) candidate_pos

DRAM

(bin, j)

Seed-position lookup (SPL)



DRAM

DRAM

DRAM


GTGCTTGGATATA

candidate_pos

Cost has a Time Component

C = T(B1N1 + B2N2 + … + P)

T B1 N1 B2 N2 CDarwin Filter 1 100 64M 1 128G 134GAll DRAM 15.6 1 128G 1,997G

44

Hardware Enables Irregular, Compressed Data Structures

(reduces memory footprint)

Dynamic Sparse Activations, Static Sparse Weights~a

�0 a1 0 a3

�

⇥ ~bPE0

PE1

PE2

PE3

0

BBBBBBBBBBBBB@

w0,0w0,1 0 w0,3

0 0 w1,2 0

0 w2,1 0 w2,3

0 0 0 0

0 0 w4,2w4,3

w5,0 0 0 0

0 0 0 w6,3

0 w7,1 0 0

1

CCCCCCCCCCCCCA

=

0

BBBBBBBBBBBBB@

b0b1

�b2b3

�b4b5b6

�b7

1

CCCCCCCCCCCCCA

ReLU)

0

BBBBBBBBBBBBB@

b0b10

b30

b5b60

1

CCCCCCCCCCCCCA

1

Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3

Relative Index 0 1 2 0 0

Column Pointer 0 1 2 3

Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE: efficient inference engine on compressed deep neural network, ISCA 2016

Trained Quantization

Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

EIE HardwarePE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Central Control

48

Simple Parallelism Often Beats a “Better” Algorithm

Broadcast Literals to Associative Memory of ClausesMore Comparisons but Lower Latency

50

Need to Accelerate the Whole Problem

Implication 300-500x Faster than CPU

Table 1

CPU Accelerator

bench_7666.smt2.cnf 2105 2.75 0.58 0.40 3.73

bench_7222.smt2.cnf 1180 3.47 0.6 0.5 4.57

bench_8061.smt2.cnf 1357 3.63 0.66 0.54 4.83

bench_13535.smt2.cnf 923 2.14 0.35 0.61 3.1

qg01-08.cnf 1993 2.07 0.26 0.44 2.77

SAT_instance_N=33.cnf 775 1.02 0.18 0.12 1.32

Number of Cycles per Boolean Constraint Propagation

1.00E+00

1.00E+01

1.00E+02

1.00E+03

bench_7666.smt2.cnf bench_7222.smt2.cnf bench_8061.smt2.cnf bench_13535.smt2.cnf qg01-08.cnf SAT_instance_N=33.cnf

CPU Accelerator

�1

Need to Accelerate the Whole Problem

57

Platforms for Acceleration

58

Implementation Alternatives

GD

DR6

LPD

DR4

GPUs Provide:• High-Bandwidth, Hierarchical Memory System

• Can be configured to match application

• Programmable Control and Operand Delivery

• Simple places to bolt on Domain-Specific Hardware• As instructions or memory clients

60

Volta V10021B xtors | TSMC 12nm FFN | 815mm2

5,120 CUDA cores7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS125 Tensor TFLOPS20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s300 GB/s NVLink

62

Tensor Core

D = AB + C

D =

FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Specialized Instructions Amortize Overhead

Operation Ops Energy** Overhead*Vs op %tot

HFMA 2 1.5pJ 20x 95%HDP4A 8 6.0pJ 5x 83%HMMA 128 130pJ 0.23x 19%IMMA 1024 230pJ 0.13x 12%

*Overhead is instruction fetch, decode, and operand fetch – 30pJ**Energy numbers from 45nm process

(map force (pairs

particles)

Mapping DirectivesProgram

Mapper &Runtime

GPU Data & Task Placement

Synthesis

Custom Compute Blocks (Instructions or Clients)

SMs

Configurable MemoryEfficient NoC

DSA Design is ProgrammingWith a Hardware Cost Model

Algorithm Mapping

66

Implementation Alternatives

GD

DR6

LPD

DR4

DPS

TEP

Multi-Chip Modules (MCMs)

Total Cost of Ownership

de-novo assembly of noisy long-reads 50x coverage

76

Conclusion

Summary• Moore’s Law is over, but we must continue scaling perf/W• Accelerators are the future

– Specialization -> Efficiency– Parallelism -> Speedup– Co-Design: The algorithm has to change

• Memory Dominates: – Minimize global memory access– Minimize memory footprint – new algorithms, sparsity, compression– Lots of small, fast on-chip memories

• GPUs as accelerator platforms– GPUs – efficient memory, communication and control– Custom blocks – instructions or clients

• DSA design is programming – with a hardware cost model

the future of computing: domain-specific accelerators2019/10/15 · the future of computing:...

Documents