specialized architectures: enabling performance scaling in

43
Specialized Architectures: Enabling Performance Scaling in a Post Moore Era David Donofrio Berkeley National Labs March 17, 2017 Supercomputing Frontiers

Upload: others

Post on 03-Jun-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Specialized Architectures: Enabling Performance Scaling in

Specialized Architectures: Enabling

Performance Scaling in a Post Moore Era

David Donofrio

Berkeley National Labs

March 17, 2017

Supercomputing Frontiers

Page 2: Specialized Architectures: Enabling Performance Scaling in

Post Moore Technology Curve

3

Great opportunities exist for innovation through the end of Moore’s Law

2016 2016-2025 2025 Sp

ec

iali

za

tio

n

Now – 2025 Moore’s Law continues through

~5nm. Beyond which diminishing

returns are expected. Dark Silicon

will begin to encourage

specialization

End of Moore’s Law End of Moore’s Law requires a

different set of optimizations to

continue performance scaling.

Opportunities for additional

specialization, reconfigurable

computing, hardware / software co-

design, etc.

Post Moore Scaling New materials possibly introduced to

allow continued process and

performance scaling.

Throughout… Continued increases in parallelism and

heterogeneity will require advanced

runtimes, programming environments and

compiler optimizations in order to take full

advantage of these new architectures

Page 3: Specialized Architectures: Enabling Performance Scaling in

Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton

Smith

2020 2025 2030Year

Technology Scaling Trends P

erf

orm

an

ce

Transistors

Thread

Performanc

e

Clock Frequency

Power (watts)

# Cores Exascale

Happens in

2021-2023

And Then

What?

Page 4: Specialized Architectures: Enabling Performance Scaling in

Superconducting

Spintronics

Photonic

ICs

CMOS

Modeling /

Simulation

Approximate

Computing

Async

Reconfigurable

Computing

Paths Forward in Post-Moore

5

Specialized

Hardware

New

Devices

New Models of

Computation

y

x

z

3D Stacking

Adv. Packages

TFETs

MRAM

Specialized

Architectures

Neuromorphic

Big

Data

Dataflow

Exascale

General

Purpose

Carbon

Nanotubes

SoC

Dark

Silicon

10 Years

+10 Year Lead

Time

Revolutionary

Heterogeneous

HPC Architectures

First Line of Defense:

More Efficient

Architectures

Page 5: Specialized Architectures: Enabling Performance Scaling in

7

How do we best encourage

architecture diversity? Open ecosystems allow optimization at every

level

Page 6: Specialized Architectures: Enabling Performance Scaling in

Open Source Hardware

8

Driving the next wave of innovation

10

In all likelihood, Weddington concedes, the

resulting technology “will never be as good as what

is commercially available.” But perhaps it could

be made good enough “to bring the power and

ability to design your own IC, or microprocessor,

to smaller and smaller groups of people and drive

down the enormous capital requirements of an

entrenched, dinosaur industry.”

Similarly, Michael Cooney of Network World15

describes the state of open-source hardware today

as roughly where open-source software was during

the mid-1990s – waiting for commercial suppliers

to provide higher levels of support. “What made

open-source software acceptable for many

businesses was the arrival of support for it, such as

Red Hat,” he says, adding, “Something similar may

take place with the hardware.”

• Rapid growth in the adoption and number of open source software projects

• More than 95% of web servers run Linux variants, approximately 85%

of smartphones run Android variants

• Will open source hardware ignite the semiconductor industry?

Is RISC-V the hardware industry’s Linux?

The Rise of Open Source Software: Will Hardware Follow Suit?

The Economics of Open-Source Innovation

GSA 2016

Page 7: Specialized Architectures: Enabling Performance Scaling in

Why Open Source Hardware?

‣ Reducing the cost of development

• By creating and sharing open hardware (RISCV, OpenSoC)

‣ Closed source IP is a major drag on innovation in all technology spaces

• Don’t need to be a big company to play – lower barrier of entry

• Open nature enables customization at all levels of the design – not just at the periphery

• Final product can still be closed

‣ Lower Cost / Complexity for Development

• Share hardware AND software stack

- Compilers, debug, Linux ports

• Focus NRE and license on new/innovative IP blocks

• Stop squeezing license costs out of items that students can implement in a summer (license *hard* stuff instead)

9

Page 8: Specialized Architectures: Enabling Performance Scaling in

Chisel: A DSL for Hardware Design

‣ Increase the productivity of hardware designers

• Chisel raises the abstraction level of hardware design

• Introduces OOP techniques to hardware development

• Encourages code reuse

‣ Hardware Generators are a more efficient technique for generating designs

• Reduce waste in design

• Reduce design time

• Reduce risk

• Reduce cost 10

A productive, flexible language for hardware design and simulation

Page 9: Specialized Architectures: Enabling Performance Scaling in

Chisel Overview

11

‣ Not “Scala to Gates”

‣ Describe hardware

functionality

‣ All the abstractions

available of a

modern language

• Now applied to

hardware design

How does Chisel work?

Algebraic Graph Construction 16

Mux( x > y, x, y) > Mux

x

y

Algebraic Graph Construction 16

Mux( x > y, x, y) > Mux

x

y

Page 10: Specialized Architectures: Enabling Performance Scaling in

Chisel: A DSL for Hardware Design

12

Hardware Generation Infrastructure with Integrated Simulation

Chisel Design

Description

SW

Model Verilog

C++

Simulator

C++ Compiler

Chisel Compiler

FPGA

Emulation

FPGA Tools

ASIC Tools

Page 11: Specialized Architectures: Enabling Performance Scaling in

13

Chisel Design

Description

SW

Model Verilog

C++

Simulator

C++ Compiler

FPGA

Emulation

FPGA Tools

ASIC Tools

SystemC

Modules

Chisel Compiler New Intermediate

Representation

Full System

Simulators

SST /

GEM5

Chisel: A DSL for Hardware Design Hardware Generation Infrastructure with Integrated Simulation

Page 12: Specialized Architectures: Enabling Performance Scaling in

RISC V Based Processors

14

‣ Multiple flavors

• Out-of-order (BOOM)

• In-Order (Rocket)

• IoT (Z-Scale)

‣ Powerful features

• 32, 64, 128-bit addressing

• Double precision floating point

• Vector accelerators

‣ Complete SW stack available

‣ Highly configurable

• Only add the features you want!

Open source, chisel based processors based on a new ISA

Page 13: Specialized Architectures: Enabling Performance Scaling in

“Open Source” doesn’t mean

“Low Performance”

Category ARM Cortex A5 RISC-V Rocket

ISA 32-bit ARM v7 64-bit RISC-V v2

Architecture Single-Issue In-Order 8-stage

Single-Issue In-Order 5-stage

Performance 1.57 DMIPS/MHz 1.72 DMIPS/MHz

Process TSMC 40GPLUS TSMC 40GPLUS

Area w/o Caches 0.27 mm2 0.14 mm2

Area with 16K Caches

0.53 mm2 0.39 mm2

Area Efficiency 2.96 DMIPS/MHz/mm2 4.41 DMIPS/MHz/mm2

Frequency >1GHz >1GHz

Dynamic Power <0.080 mW/MHz 0.034 mW/MHz

Rocket Area Numbers Assuming 85% Utilization,

the same number ARM used to report area.

Plots are not to scale. 26

15

Sometimes you get more than paid for…

Y. Lee UCB @ Hotchips 2016

Page 14: Specialized Architectures: Enabling Performance Scaling in

16

“Open Source” doesn’t mean

“Low Performance” Sometimes you get more than paid for…

Y. Lee UCB @ Hotchips 2016

Page 15: Specialized Architectures: Enabling Performance Scaling in

Building a HPC System out of RISC-V?

17

‣ RISC-V Gaining significant momentum

• Large community of SW and HW developers

• Official ISA of India

‣ Many RISC-V based implementations in the wild

• Most not customer facing… yet

‣ Long term investment

Is it crazier than using ARM?

Page 16: Specialized Architectures: Enabling Performance Scaling in

OpenSoC Fabric

18

‣ Greater number of cores per chip driving the evolution of sophisticated on-chip networks

• Needed new tools and techniques to evaluate tradeoffs

‣ Chisel-based

• Allows high level of parameterization

- Dimensions, topology, VCs, etc. all configurable

• Fast, functional SW model for integration with other simulators

• Verilog model for FPGA and ASIC flows

‣ Multiple Network Interfaces

• Integrate with wide variety of existing processors

- Tensillica, RISC-V, ARM, etc.

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AX

I

CPU(s)

AXI

CPU(s)A

XI

AXI

10G

bE

PC

Ie

An Open-Source, Flexible, Parameterized, NoC

Generator

memct

l

memct

l

Memory DRAM

Memory DRAM

PC

Ie

FL

AS

H

ctl

IB o

r

Gig

E

IB o

r

Gig

E

Page 17: Specialized Architectures: Enabling Performance Scaling in

SC15 Demo: 96 Core SoC Design for HPC

19

‣ Z-Scale processors

connected in a

Concentrated Mesh

‣ 4 Z-scale processors

‣ 2x2 Concentrated mesh

with 2 virtual channels

‣ Micron HMC Memory

http://www.codexhpc.org/?p=367

FPGA 0 FPGA 1 FPGA 2

FPGA 5FPGA 4FPGA 3

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

CoreOff-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

CoreOff-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

CoreOff-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

CoreOff-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

CoreOff-Chip

Core Core

Core Core

Core DDR

Core Core

Core Core

DDR Core

Core Core

CoreOff-Chip

2 people spent 2 months to

create

Page 18: Specialized Architectures: Enabling Performance Scaling in

96 Core System: Results

20

Comparing conventional cache coherence protocol To direct hardware support for global address space for halo exch.

0 100 200 300 400 500 600

Local Exchange

Remote Exchane

Cycles

Inter-Thread Latency

RISCV-SoC

x86

Page 19: Specialized Architectures: Enabling Performance Scaling in

If you thought 96 cores was cool…

21

‣ 1680 32-bit RISC-V

Cores

• 30 rows x 7 Col

‣ 26MB SRAM

‣ 300bit NoC

GRVI Phalanx Accelerator (Jan Gray)

Page 20: Specialized Architectures: Enabling Performance Scaling in

24

Page 21: Specialized Architectures: Enabling Performance Scaling in

Accelerating the Design Process

25

Creating a complete suite of tools for rapid processor and compiler generation

OpenSoC OpenSoC Fabric

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AX

I

CPU(s)

AXI

CPU(s)

AX

I

AXI

10G

bE

PC

Ie

Page 22: Specialized Architectures: Enabling Performance Scaling in

Accelerating the Design Process

26

Creating a complete suite of tools for rapid processor and compiler generation

OpenSoC Fabric

OpenSoC OpenSoC

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AX

I

CPU(s)

AXI

CPU(s)

AX

I

AXI

10G

bE

PC

Ie

Page 23: Specialized Architectures: Enabling Performance Scaling in

Accelerating the Design Process

27

Creating a complete suite of tools for rapid processor and compiler generation

OpenSoC Fabric

OpenSoC Compiler

OpenSoC OpenSoC

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AX

I

CPU(s)

AXI

CPU(s)

AX

I

AXI

10G

bE

PC

Ie

Page 24: Specialized Architectures: Enabling Performance Scaling in

Accelerating the Design Process

28

Creating a complete suite of tools for rapid processor and compiler generation

OpenSoC Fabric

OpenSoC Compiler

OpenSoC Cores

OpenSoC OpenSoC

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AX

I

CPU(s)

AXI

CPU(s)

AX

I

AXI

10G

bE

PC

Ie

Page 25: Specialized Architectures: Enabling Performance Scaling in

Accelerating the Design Process

29

Creating a complete suite of tools for rapid processor and compiler generation

OpenSoC Fabric

OpenSoC Compiler

OpenSoC Cores

OpenSoC System Architect

OpenSoC OpenSoC

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AX

I

CPU(s)

AXI

CPU(s)

AX

I

AXI

10G

bE

PC

Ie

Page 26: Specialized Architectures: Enabling Performance Scaling in

30

Instruction Set

Extensions Black Box

Modules

RISC-V ISA & Extensions

Chisel Modules

LLVM Compiler Backend

Full-Chip

Verilog

Chisel SoC

C++ Cycle-

Based

Simulator

Module

TestBenches

VLSI Tools

FPGA Tools

OpenSoC

System Architect A complete hardware and

software development toolkit

Page 27: Specialized Architectures: Enabling Performance Scaling in

31

Specialization and FPGAs Look

Compelling… What do we need to make these concepts

mainstream?

Page 28: Specialized Architectures: Enabling Performance Scaling in

Programmability

32

Remember 2004?

Buck, Owens, Riffel, Lefhon , et al.

Page 29: Specialized Architectures: Enabling Performance Scaling in

New FPGAs are very capable

33

‣ FPGA +Stacked

Memory SIP

• 1GHz Fabric

• Quad core A53

• >6 TFlops Single

Precision

• 16GB HBM on

package, 1TB/s BW

• Tb/s IO

Riding Moore’s Law to the end…

Page 30: Specialized Architectures: Enabling Performance Scaling in

Programmability

‣ Parallels with early days of

GPGPU computing

• VERY capable hardware

‣ New languages raising

abstraction levels

• OpenCL

• OpenACC

‣ But the tools are still terrible

34

FPGA ecosystem promoting development of new tools and abstractions to accelerate development

Page 31: Specialized Architectures: Enabling Performance Scaling in

Programmability

‣ Previous model:

• Translate your algorithm from Fortran/C/etc. to a

hardware description

• Highest performance gain, but most significant

development effort

‣ New model:

• Instantiate an array of specialized “soft” cores that can be

targeted much like a GPU

• Greater flexibility, simpler hardware development

• Enables acceleration of applications not typically

associated with FPGA computing

35

No need to translate your algorithm directly to hardware

Page 32: Specialized Architectures: Enabling Performance Scaling in

Creating an architecture per motif

36

Page 33: Specialized Architectures: Enabling Performance Scaling in

Catapult

‣ Microsoft Catapult (ISCA 2014)

• Bing search acceleration

• Lower Cost and Power

• Programmable

• Composed of customized soft cores

37

Real world deployment of FPGA accelerators

Page 34: Specialized Architectures: Enabling Performance Scaling in

A Potential Exascale Node

38

A notional representation using specialization

Network / Mem

DDR

CPU

Mem

Acc

1

Mem

Acc

2

Mem

Acc

N

Page 35: Specialized Architectures: Enabling Performance Scaling in

A Potential Exascale Node

39

A notional representation using specialization

Network / Mem

DDR

CPU

Mem

FPGA

1

Mem

FPGA

2

Mem

FPGA

N

Page 36: Specialized Architectures: Enabling Performance Scaling in

Advanced Manufacturing

‣ Can combine processing and memory in single

package with efficient interconnect

40

3D Stacking

Novati, EE Times

Page 37: Specialized Architectures: Enabling Performance Scaling in

41

Looking out to the edge… Applying these tools and techniques to other

scientific / computational areas

Page 38: Specialized Architectures: Enabling Performance Scaling in

On-detector processing

‣ The Problem:

• Future detectors threaten to overwhelm data transfer and computing capabilities w/ data rates exceeding 1 Tb/s

• Data processing experiment driven

‣ Proposed solution:

• Process the data before it leaves the sensor

• Application-tailored, programmable processing allows data reduction to occur on-sensor

• Programmability allows data reduction techniques to be tailored to the experiment – even after the sensor is built!

42

Putting our hardware design tool suite to work to augment existing HPC resources

0

10

20

30

40

50

60

2010 2011 2012 2013 2014 2015

Incre

ase

ov

er

20

10

Projected Rates

Sequencers

Detectors

Processors

Memory

Page 39: Specialized Architectures: Enabling Performance Scaling in

What’s Next?

‣ Need to continue to deliver increased performance scaling

‣ FPGAs emerging as the next computational element

• Moving away from monolithic implementations to specialized soft cores

‣ Can we get some better tools?

• HW design getting better, but not enough

• Tune Chisel compiler for FPGAs

‣ Already being deployed at scale

• Catapult

• AWS

- Bitfile marketplace

None of this matters without advanced software – programming models, runtimes, etc

43

Page 40: Specialized Architectures: Enabling Performance Scaling in

44

Thank you! [email protected]

Page 41: Specialized Architectures: Enabling Performance Scaling in

‣ Backup

45

Page 42: Specialized Architectures: Enabling Performance Scaling in

46

Future Elect ron Scat ter ing Detector

100,000 fps pixel detector

576 x 576 x 10 #m

Segmented silicon HAADF

Fabricate detectors Q4CY2016

Dedicated (donated) 400 Gbs link to NERSC

Link testing underway

Stream events to processors on Cori

Future goal: firmware processing (reduce data rate)

1985 1990 1995 2000 2005 2010 2015

107

108

109

1010

1011

1012

1013

1014

! "#$%# &' ()*

+# , %%-%# &' ()*

. "! # /# 012341356478&' ()*

. "! # 96:61:32;%8&' ()*

<=/96:61:32$. &' ()*

+>226?:@*/A>?(6(/BCADEF. "# /96:61:32=%%. &' ()*

96:61:32/' /# 012341356G?4:)@@):03? H6)2

9):)

D):6

IJ*:6

4'()*K

Future Electron Scattering Detector4 PB/day

Page 43: Specialized Architectures: Enabling Performance Scaling in

47

P5$5(R5$' 3(M(: +1%B ' 3;

TKKK TKKf TKJK TKJf TKTK TKTf

0' (' 1>' (/ 1[WI- \

'

z

2

JKi

JKi

JKi

JKh

Q? Y. (P' $' #$+&

JKJK JKJJ

dT OP(AQ? .

JKl

m53$I I P

JKJJ

: ' &/ m53$I I P

JKh

' >#8"4

JKJJ

P5$5(3' $3(5&' (3' #+263($+(B +2$83

P5$5(&5$' 3(5&' ;(Q+65/ ;(KNJ(Q- M3(7 TKJU;(Q- M3(7 TKTK;(JK(Q- M3

e/ - &"6! "X' 13

JKU

Y2(56B "$$' 61/ (45&+#8"51(S"' G4+"2$