the future of computing: domain-specific accelerators2019/10/15 · the future of computing:...
TRANSCRIPT
The Future of Computing:Domain-Specific Accelerators
MicroOctober 15, 2019
Bill DallyChief Scientist and SVP of Research, NVIDIA Corporation
Professor (Research), Stanford University
Faster Computing
BetterAlgorithms
MoreData
Value
3
We need to continue delivering improved performance and perf/W
4
But Process Technology isn’t Helping us Anymore
Moore’s Law is Dead
John
Hen
ness
y an
d D
avid
Pat
ters
on, C
ompu
ter A
rchi
tect
ure:
A Q
uant
itativ
e Ap
proa
ch, 6
/e. 2
018
6
Accelerators can continue scaling
perf and perf/W
Fast Accelerators since 1985• Mossim Simulation Engine: Dally, W.J. and Bryant, R.E., 1985. A hardware architecture for switch-
level simulation. IEEE Trans. CAD, 4(3), pp.239-250.• MARS Accelerator: Agrawal, P. and Dally, W.J., 1990. A hardware logic simulation system. IEEE
Trans. CAD, 9(1), pp.19-29.• Reconfigurable Arithmetic Processor: Fiske, S. and Dally, W.J., 1988. The reconfigurable arithmetic
processor . ISCA 1988.• Imagine: Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P. and Owens, J.D.,
2003. Programmable stream processors. Computer, 36(8), pp.54-62.• ELM: Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J. and
Sheffield, D., 2008. Efficient embedded computing. Computer, 41(7).• EIE: Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE:
efficient inference engine on compressed deep neural network, ISCA 2016• SCNN:Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler,
S.W. and Dally, W.J., 2017, June. Scnn: An accelerator for compressed-sparse convolutional neural networks, ISCA 2017
• Darwin: Turakhia, Bejerano, and Dally, “Darwin: A Genomics Co-processor provides up to 15,000×acceleration on long read assembly”, ASPLOS 2018.
• SATiN: Zhuo, Rucker, Wang, and Dally, “Hardware for Boolean Satisfiability Inference,” Under Review.
Accelerators Employ:• Massive Parallelism – >1,000x, not 16x – with Locality
• Special Data Types and Operations• Do in 1 cycle what normally takes 10s or 100s
• Optimized Memory• High bandwidth (and low energy) for specific data structures and operations
• Reduced or Amortized Overhead
• Algorithm-Architecture Co-Design8
Specialized Hardware is Everywhere
• Cell phones– Software on ARM cores
• High-complexity, low-compute work
– Accelerators do the heavy lifting• MODEMs• CODECs• Camera Image Processing• DNNs• Graphics
• GPUs– Rasterizer– Texture filter– Compositing– Compression/Decompression– Tensor computations– BVH Traversal– Ray-Triangle Intersection
• Does most of the work• But is mostly invisible
10
Specialized Operations
Orders of Magnitude Efficiency
Moderate Speedup
Specialized Operations
Dynamic programming for gene sequence alignment (Smith-Waterman)
On 14nm CPU On 40nm Special Unit35 ALU ops, 15 load/store 1 cycle (37x speedup)37 cycles 3.1pJ (26,000x efficiency)81nJ 300fJ for logic (remainder is memory)
Why is a Specialized PE 26,000x More Efficient?
OOO CPU Instruction – 250pJ (99.99% overhead, ARM A-15)
Area is proportional to energy – all 28nm
16b Int Add, 32fJ
Evangelos Vasilakis. 2015. An Instruction Level Energy Characterization of Arm Processors. Foundation of Research and Technology Hellas, Inst. of Computer Science, Tech. Rep. FORTH-ICS/TR-450 (2015)
13
Specialization -> Efficiency
Efficiency -> Parallelization
Parallelization -> Speedup
Specialized Operations
Dynamic programming for gene sequence alignment (Smith-Waterman)
Specialization -> 37x speedup, 26,000x efficiency, 270,000x for logic
Efficiency -> Parallelism 64 PE arrays x 64 PEs per array, 4,096x total
Speedup = 37 (Specialization) x 4,034 (Parallelism) = 150,000x total
Specialized Operations
Dynamic programming for gene sequence alignment (Smith-Waterman)
Specialization -> 37x speedup, 26,000x efficiency, 270,000x for logic
Efficiency -> Parallelism 64 PE arrays x 64 PEs per array, 4,096x total
Speedup = 37 (Specialization) x 4,034 (Parallelism) = 150,000x total
16
Accelerator Design is Guided by Cost
Arithmetic is Free(particularly low-precision)
Memory is expensive
Communication is prohibitively expensive
Need to Understand Cost of OperationsAnd Communication
Operation: Energy (pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640
Area (µm2)366713713604184282349516407700N/AN/A
Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
Communication is Expensive, Be Small, Be Local
LPDDR DRAMGB
On-Chip SRAMMB
Local SRAMKB
640pJ/word
50pJ/word
5pJ/word
Scaling of Communication
0
50
100
150
200
250
300
350
DFMA 40nm DFMA 10nm Wire 40nm Wire 10nm
pJ
Keckler et al. Micro 2011.
20
The Algorithm Often Has to Change
To Avoid Being Global Memory Limited
Algorithm-Architecture Co-Design for DarwinStart with Graphmap
21
0.1 1 10 100 1000 10000 100000Time/read (ms)
Filtration Alignment
Graphmap~10K seeds~440M hits
Filtration
Alignment~3 hits
~1 hits
1. Graphmap (software)
1
Algorithm-Architecture Co-Design for DarwinReplace Graphmap with Hardware-Friendly AlgorithmsSpeed up Filtering by 100x, but 2.1x Slowdown Overall
0.1 1 10 100 1000 10000 100000Time/read (ms)
Filtration Alignment
Graphmap~10K seeds~440M hits
Darwin~2K seeds~1M hits
Filtration(D-SOFT)
Alignment(GACT)
Filtration
Alignment~3 hits
~1 hits
~1680 hits
~1 hits
2.1X slowdown
1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)
1
2
Algorithm-Hardware Co-Design for DarwinAccelerate Alighment – 380x Speedup
23
0.1 1 10 100 1000 10000 100000Time/read (ms)
Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration
2.1X slowdown
380X speedup
1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration
1
2
3
Algorithm-Hardware Co-Design for Darwin4x Memory Parallelism – 3.9x Speeedup
24
0.1 1 10 100 1000 10000 100000Time/read (ms)
Filtration Alignment
DRAM
DRAM
DRAM
DRAM
SPL
SPL
SPL
SPL
2.1X slowdown
380X speedup
3.9X speedup
1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT1
2
3
4
Algorithm-Hardware Co-Design for DarwinSpecialized Memory for D-Soft Bin Updates – 15.6x Speedup
25
0.1 1 10 100 1000 10000 100000Time/read (ms)
Filtration Alignment
DRAM
DRAM
DRAM
DRAM
SPL
SPL
SPL
SPL
UBL
UBL
Bin-countSRAM
Bin-countSRAM
2.1X slowdown
380X speedup
3.9X speedup
15.6X speedup
1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to
SRAM (ASIC)
1
2
3
4
5
Algorithm-Hardware Co-Design for DarwinPipeline D-Soft and GACT – now completely D-Soft limited – 1.4x
Overall 15,000x
26
0.1 1 10 100 1000 10000 100000Time/read (ms)
Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to
SRAM (ASIC)6. Pipeline D-SOFT and GACT
2.1X slowdown
380X speedup
3.9X speedup
15.6X speedup
1.4X speedup
D-SOFT
Software
GACT60
GACT61
GACT62
GACT63
GACT0
GACT1
GACT2
GACT3
1
2
3
4
5
6
27
Memory Dominates
28
Memory dominates power and area
Memory Dominates
31
Algorithms must be memory optimized
Minimize global memory accesses
Keep local memory footprint small
GACT Alignment• 15M Reads, 10k bases each, ~2k hits each
– ~300T Alignments to be done– Additional parallelism within each alignment
• But long reads have large (10M) memory footprint• Solution: GACT (Tiling)
GACT Alignment• 15M Reads, 10k bases each, ~2k hits each
– ~300T Alignments to be done– Additional parallelism within each alignment
• But long reads have large (10M) memory footprint• Solution: GACT (Tiling)
Darwin GACT hardware4k PEs - 64 PEs per Array x 64 Arrays~50 operations per cycle per PE200k operations per cycleSpecialized memory150,000x speedup vs CPU
34
On-Chip Memory Cost per Bit is 10-100x Commodity DRAM
And It’s Often Less Expensive
D-SOFT: Algorithm Overview
AGCTTTACCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
0 -inf
0 -inf
0 -inf
0 -inf
0 -inf
Bin count (bases)
Last hit offset
Bin 1 Bin 2 Bin 3 Bin 4 Bin 5
Slope=1
D-SOFT: Algorithm Overview
AGCTTTACCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
.
.
.
21
23
29
.
.
.
.
38
12
31
5
.
.
.
.
.
20
21
22
23
.
.
.
.
.
.
GG
GT
TA
.
.
0 -inf
0 -inf
2 0
2 0
0 -inf
Bin count (bases)
Last hit offset
Pointer Table Position Table
D-SOFT: Algorithm Overview
AGCTTTACCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
.
.
.
17
21
21
.
.
.
.
1
15
18
38
.
.
.
.
.
17
18
19
20
21
.
.
.
.
.
GA
GC
GG
.
.
2 2
0 -inf
4 2
2 0
2 2
Bin count (bases)
Last hit offset
Pointer Table Position Table
D-SOFT: Algorithm Overview
AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
.
.
12
17
17
.
.
.
.
2
8
16
22
28
1
.
.
.
12
13
14
15
16
17
.
.
.
.
CG
CT
GA
.
.
.
3 3
2 3
5 3
4 3
2 2
Bin count (bases)
Last hit offset
Pointer Table Position Table
D-SOFT: Algorithm Overview
AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
.
.
.
.
.
32
33
39
.
.
.
3
4
25
26
34
35
.
.
.
33
34
35
36
37
38
.
.
.
.
.
TC
TG
TT
4 4
2 3
5 3
5 4
2 2
Bin count (bases)
Last hit offset
Pointer Table Position Table
D-SOFT: Algorithm Overview
AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
.
.
.
.
.
32
33
39
.
.
.
.
17
3
4
.
.
.
.
.
.
32
33
34
.
.
.
.
.
.
.
TC
TG
TT
4 4
2 3
7 5
5 4
2 2
Bin count (bases)
Last hit offset
Pointer Table Position Table
D-SOFT: Algorithm Overview
AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
Parameters:
k: seed sizeN: number of seedsh: threshold on non-overlapping bases
B: bin size (number of bases, fixed to 128)
(k=2, N=6, h=6)
4 4
2 3
7 5
5 4
2 2
Bin count (bases)
Last hit offset
D-SOFT: Hardware-acceleration
R0
R15
R16
R31
Arbiter
Bin-countSRAM 1
Update-bin logic (UBL)
NZ binsSRAM
Update-bin logic (UBL)
Bin-countSRAM 16
NZ binsSRAM
Network-on-chip (16-endpoint
Butterfly)Seed-position lookup (SPL)
(seed, j) candidate_pos
DRAM
(bin, j)
Seed-position lookup (SPL)
Seed-position lookup (SPL)
Seed-position lookup (SPL)
DRAM
DRAM
DRAM
AGCTTTCCCTACGTAGCTGCATCTATTTCTCGTATTTAGC
GTGCTTGGATATA
candidate_pos
Cost has a Time Component
C = T(B1N1 + B2N2 + … + P)
T B1 N1 B2 N2 CDarwin Filter 1 100 64M 1 128G 134GAll DRAM 15.6 1 128G 1,997G
44
Hardware Enables Irregular, Compressed Data Structures
(reduces memory footprint)
Dynamic Sparse Activations, Static Sparse Weights~a
�0 a1 0 a3
�
⇥ ~bPE0
PE1
PE2
PE3
0
BBBBBBBBBBBBB@
w0,0w0,1 0 w0,3
0 0 w1,2 0
0 w2,1 0 w2,3
0 0 0 0
0 0 w4,2w4,3
w5,0 0 0 0
0 0 0 w6,3
0 w7,1 0 0
1
CCCCCCCCCCCCCA
=
0
BBBBBBBBBBBBB@
b0b1
�b2b3
�b4b5b6
�b7
1
CCCCCCCCCCCCCA
ReLU)
0
BBBBBBBBBBBBB@
b0b10
b30
b5b60
1
CCCCCCCCCCCCCA
1
Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3
Relative Index 0 1 2 0 0
Column Pointer 0 1 2 3
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE: efficient inference engine on compressed deep neural network, ISCA 2016
Trained Quantization
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015
EIE HardwarePE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Central Control
48
Simple Parallelism Often Beats a “Better” Algorithm
Broadcast Literals to Associative Memory of ClausesMore Comparisons but Lower Latency
50
Need to Accelerate the Whole Problem
Implication 300-500x Faster than CPU
Table 1
CPU Accelerator
bench_7666.smt2.cnf 2105 2.75 0.58 0.40 3.73
bench_7222.smt2.cnf 1180 3.47 0.6 0.5 4.57
bench_8061.smt2.cnf 1357 3.63 0.66 0.54 4.83
bench_13535.smt2.cnf 923 2.14 0.35 0.61 3.1
qg01-08.cnf 1993 2.07 0.26 0.44 2.77
SAT_instance_N=33.cnf 775 1.02 0.18 0.12 1.32
Number of Cycles per Boolean Constraint Propagation
1.00E+00
1.00E+01
1.00E+02
1.00E+03
bench_7666.smt2.cnf bench_7222.smt2.cnf bench_8061.smt2.cnf bench_13535.smt2.cnf qg01-08.cnf SAT_instance_N=33.cnf
CPU Accelerator
�1
Need to Accelerate the Whole Problem
57
Platforms for Acceleration
58
Implementation Alternatives
GD
DR6
LPD
DR4
GPUs Provide:• High-Bandwidth, Hierarchical Memory System
• Can be configured to match application
• Programmable Control and Operand Delivery
• Simple places to bolt on Domain-Specific Hardware• As instructions or memory clients
60
Volta V10021B xtors | TSMC 12nm FFN | 815mm2
5,120 CUDA cores7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS125 Tensor TFLOPS20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s300 GB/s NVLink
62
Tensor Core
D = AB + C
D =
FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
Specialized Instructions Amortize Overhead
Operation Ops Energy** Overhead*Vs op %tot
HFMA 2 1.5pJ 20x 95%HDP4A 8 6.0pJ 5x 83%HMMA 128 130pJ 0.23x 19%IMMA 1024 230pJ 0.13x 12%
*Overhead is instruction fetch, decode, and operand fetch – 30pJ**Energy numbers from 45nm process
(map force (pairs
particles)
Mapping DirectivesProgram
Mapper &Runtime
GPU Data & Task Placement
Synthesis
Custom Compute Blocks (Instructions or Clients)
SMs
Configurable MemoryEfficient NoC
DSA Design is ProgrammingWith a Hardware Cost Model
Algorithm Mapping
66
Implementation Alternatives
GD
DR6
LPD
DR4
DPS
TEP
Multi-Chip Modules (MCMs)
Total Cost of Ownership
de-novo assembly of noisy long-reads 50x coverage
76
Conclusion
Summary• Moore’s Law is over, but we must continue scaling perf/W• Accelerators are the future
– Specialization -> Efficiency– Parallelism -> Speedup– Co-Design: The algorithm has to change
• Memory Dominates: – Minimize global memory access– Minimize memory footprint – new algorithms, sparsity, compression– Lots of small, fast on-chip memories
• GPUs as accelerator platforms– GPUs – efficient memory, communication and control– Custom blocks – instructions or clients
• DSA design is programming – with a hardware cost model