energy-efficient architectures for exascale...

27
Dr. Stephen W. Keckler Senior Director of Architecture Research, NVIDIA ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMS

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

Dr. Stephen W. Keckler Senior Director of Architecture Research, NVIDIA

ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMS

Page 2: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

2

The Goal:

Sustained ExaFLOPS on Problems of Interest

...

at reasonable cost

Page 3: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

3 Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

The End of Historic

Scaling

Page 4: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

4

2017

CORAL 150-300PF (5-10x)

11MW (1.1x) 14-27 GFLOPs/W (7-14x)

Lots of Threads

20PF 18,000 GPUs

10MW 2 GFLOPs/W ~107 Threads

You Are Here

1,000PF (50x) 72,000HCNs (4x)

20MW (2x) 50 GFLOPs/W (25x)

~1010 Threads (1000x) 2013

2023

Page 5: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

5

HETEROGENEOUS NODE System Interconnect

NoC

MC NIC

LOC

0

LOC

7

DRAM Stacks

TOC

1

TOC

2

TOC

3

TOC0

Lane

15

Lane

0

TPC0

L20 1MB

L21 1MB

L22 1MB

L23 1MB

TPC

127

NVL

ink

NVLink NoC

MC DRAM DIMMs LLC

NV RAM

Page 6: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

6

How do we get to 50GFlops/Watt?

Page 7: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

7

Start with an energy-efficient architecture

Page 8: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

8

CPU 130 pJ/flop (Vector SP)

Optimized for Latency Deep Cache Hierarchy

Haswell 22 nm

GPU 30 pJ/flop (SP) Optimized for Throughput

Explicit Management of On-chip Memory

Maxwell 28 nm

Page 9: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

9

CPU 2 nJ/flop (Scalar SP)

Optimized for Latency Deep Cache Hierarchy

Haswell 22 nm

GPU 30 pJ/flop (SP) Optimized for Throughput

Explicit Management of On-chip Memory

Maxwell 28 nm

Page 10: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

10

HOW IS POWER SPENT IN A CPU?

In-order Embedded OOO Hi-perf

Clock + Control Logic 24%

Data Supply 17%

Instruction Supply 42%

Register File 11%

ALU 6% Clock + Pins

45%

ALU 4%

Fetch 11%

Rename 10%

Issue 11%

RF 14%

Data Supply 5%

Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

20pJ FP op è1nJ instruction

Page 11: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

11

Latency-Optimized Core (LOC)

PC

PC

Branch Predict

I$

Register Rename

ALU 1 ALU 2 ALU 4 ALU 3

Reorder Buffer

Instruction Window

Register File

ALU 1 ALU 2 ALU 4 ALU 3

I$

Select

Register File

PCs

Throughput-Optimized Core (TOC)

Page 12: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

12

How do we continue to scale energy efficiency

...in a world where technology scaling is diminished?

Page 13: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

13

Do Less Work

Eliminate waste and redundancy

Move fewer bits

Move data more efficiently

Page 14: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

14

DO LESS WORK Mixed Precision Arithmetic

1 11 52 double-precision

1 8 23 single-precision

1 5 10 half-precision 4x throughput

4x bandwidth 4x capacity

< ¼ energy/op

5x precision bits 60x range

Only use as much precision as you need Exploit mix of representations Scaled arithmetic

Page 15: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

15

ELIMINATE WASTE Temporal SIMT

32-wide datapath

time

1 cy

c

1 warp instruction = 32 threads

thread 0

thread 31

ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

Spatial SIMT (current GPUs) 1-wide

time

1 cy

c

ld ld ld ld ld ld ld

0 (threads)

1 2 3 4 5 6 7 ld

ld ld

8 9

ld 10

Pure Temporal SIMT

Page 16: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

16

ELIMINATE WASTE Temporal SIMT

32-wide (41%)

4-wide (65%)

1-wide (100%)

Increase efficiency on divergent code

Page 17: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

17

ELIMINATE WASTE Variable Warp Sizing

Small warps

+ Improved perf for divergent code

+ Better SIMD utilization

Emulate wide warp HW

+ Wider converged execution

+ Memory locality/convergence

+ Reduced power (frontend)

scheduler

Time Gang schedule warps with same PC

Schedule several warps, different PCs scheduler

Rogers [ISCA 2015]

Page 18: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

18

ELIMINATE REDUNDANCY Scalarization

Lee [CGO 2013]

LD R2ç<A> LD R3çR2, 1 ADD R4çR3, 2

LD R2ç<A> LD R3çR2, 2 ADD R4çR3, 2

LD R2ç<A> LD R3çR2,3 ADD R4çR3, 2

LD R2ç<A> LD R3çR2, 4 ADD R4çR3, 2

scalar op vector load vector op

SIMT Execution

ADD R4çSR3, 2 ADD R4çSR3, 2 ADD R4çSR3, 2 ADD R4çSR3, 2

LD SR2ç<A> VLD SR3çSR2, 1

Scalarized SIMT Execution

Page 19: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

19

MOVE FEWER BITS

Small multi-ported register file

Capture locality of commonly used operands

Can reduce RF energy by 50%

Register File Cache (RFC)

S F U

M E M

T E X

Operand Routing

Operand Buffering

Main Register File 4x128-bit Banks (1R1W)

RFC 4x32-bit (3R1W) Banks

ALU

Gebhart [ISCA 2011]

Page 20: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

20

MOVE DATA MORE EFFICIENTLY Toggle-aware Compression

0x00003A00 0x8001D000 0x00003A01 0x8001D008 ...

4�bytes128Ͳbyte�Uncompressed�Cache�Line

4�bytes

8Ͳbyte�flit

0x00003A00 0x8001D000

0x00003A01 0x8001D008

XOR

Flit�0

Flit�1

=0000...00100...00100... #�Toggles�=�2

0x5�0x3A00�0x7�0x8001D000

128Ͳbyte�FPCͲcompressed�Cache�Line

8Ͳbyte�flit

5�3A00�7�80001D000�5�1D

XOR

Flit�0

Flit�1

=

001001111...110100011000 #�Toggles�=�31

0x5�0x3A01�0x7�0x8001D008 0x5�...

1�01�7�80001D008�5�3A02�1

Metadata

Compression can increase power consumption

Goal: reduce bus toggling

Pekhimenko [HPCA 2016]

Page 21: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

21

MINIMIZE DATA MOVEMENT

Reduces distance

Increases bandwidth

Offers opportunity to optimize signaling circuits

Packaging

High-bandwidth on-package memory

Page 22: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

22

MINIMIZE DATA MOVEMENT Heterogeneous DRAM Architectures

~200 GB/s

High BW Memory TOCs LOCs

~2TB/s ~200 GB/s

High Capacity Memory

~200 GB/s

High BW Memory TOCs

~2TB/s

High Capacity Memory

Challenges Exploiting all available bandwidth Maximizing locality for frequently accessed data

Page 23: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

23

MINIMIZE DATA MOVEMENT Software-managed Caching with On-Package Memory

0"

1"

2"

3"

4"

5"

6"

backp

rop(

bfs(

cns(

comd(

kmeans(

minife(

mumm

er(

needle(

pathfi

nder(

srad_v1(

xsbench(

geo:m

ean(

Throughp

ut(Rela=

ve(

to(No(Migra=o

n(

Legacy"CUDA" First8Touch"+"Range"Exp"+"BW"Balancing" ORACLE"

Strategies Aggressively migrate pages upon First-Touch to GDDR memory Pre-fetch neighbors of touched pages to reduce TLB shootdowns Throttle page migrations when nearing peak BW

Competitive with manual memory copy Close to “perfect” prefetch

Agarwal [HPCA 2015]

Page 24: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

24

MINIMIZE DATA MOVEMENT

Tag overhead: hundreds of MB

Alloy tag and data in same DRAM row (Micro12)

Cache organization: optimize for bandwidth

Direct mapped, consecutive sets in same row

Results

Fine-grained transfers good for lower locality apps

Can eliminate some page migration overheads

Hardware Managed DRAM Cache RO

W BU

FFER

Tag Set Index Offset

16-byte DRAM bus

29-bits 5-bits RO

W BU

FFER

1GB D

RAM

(512M lines)

tag*

(2-byte) data

(64-byte)

Only medium of accessing DRAM array

Page 25: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

25

LOOMING MEMORY POWER CRISIS

Column Power

120W

160W

Page 26: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop

26

SUMMARY

Do Less Work

Eliminate waste and redundancy

Move fewer bits

Move data more efficiently

Page 27: ENERGY-EFFICIENT ARCHITECTURES FOR EXASCALE SYSTEMSimages.nvidia.com/events/sc15/pdfs/SC_15_Keckler_distribute.pdf · Start with an energy-efficient architecture . 8 CPU 130 pJ/flop