seismic shift: what's next in the post-moore, post-isa era?€¦ · e., tht conceptual...

Seismic Shift: What’s next in the Post-Moore, Post-ISA era?

Prof. Margaret MartonosiH. T. Adams ‘35 Prof. of Computer SciencePrinceton University

Not News: End of Decades of Moore’s Law scalingLess talked about: Shift to heterogeneity & parallelism has broken how software scales in performance & cost.

f Microprocessor Trend Data

. W ~ ........

' ~~ ... -- -:------ --- ---- ----- ----. -- ·-·:- . ---- · -· ---- · -·. -·:..-~. -..,·: ·. - · - · - ---·- · - ---· - . ----·

: : r"'IIL•. : : ... •~i... : --_ j_ _ --------------. -----. ----_!_ - - - - - . - _-1_ • .... ~.----_!_ - . - - - - - . - - - - - - - - - - - . - - - - -

: : ~· ~ ·-: ... ~ 11l)•1•'----~- --- ---- ------• --t1-·· ~ :- ·· --J ------ -----------~-- -------- ------ ------- --

. ~tJ ... ~ .......... 4

~----~-~--~:~; ~ ~- + • 102 ~ ~ L~i:, :::;-, ~ T ·:T ~•\ ~f

: - : ~ ... :'Y ... • ...... 1 0 1 ·- --- ----- -. ----- -·- -- -. -:- -- -·- ---. --- ---·- --- ---- --:- ---- ----- -·- --- • - -.. -. --:- .. -. -. -.. ----- -♦ . --♦ - .• -. •♦ -. ----. ------. ----. ... ■ ■ : ... : ..

... : ... ... ... :'Y ... ... *& .., ■ T: ... ,..! •••" I

- ---♦ ·--- - ---- - -- +; -----♦-------- -- ----•+:• --... 11111• -4'H ....... ----- ---- ~-- ------- --- ----- ------- · ' ' ' ' ' '

Transistors (thousands)

Single-Thread Performance (SpeclNT x 103)

Frequency (MHz)

ical Power (Watts)

Number of Logical Cores

1970 1980 1990 2000 2010 2020 Year

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, 0 . Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

1964: While Moore’s Law was being sketched, “Computer Architecture” was coined…

= The value of HW/SW abstraction…

G. M. Amdahl

G. A. Blaauw

F. P. Brooks, Jr.,

Architecture of the IBM System/360

Abstrad: The archiledure • of the newly announced IBM System/360 features four Innovations:

1. An approach to storage which permits and exploits very large capacities, hierarchies of speeds, read

only storage for microprogram control, flexible storage protection, and simple program relocation.

2. An input/output system offering new degrees of concurrent operation, compatible channel operation,

data rates approaching 5,000,000 charaders/second, integrated design of hardware and software, a new

low-cost, multiple-channel package sharing main-frame hardware, new provisions for device status Infor

mation, and a standard channel interface between central processing unit and input/ output devices.

3. A truly general-purpose machine organization offering new supervisory facilities, powerful logical pro-

is used here to describe the attributes of a programmer. i .. e., tht conceptual structure and

~Tbe term a-Ychitectu,-e system as setn by the functiooal behavior, as distinct

line of six models having a per-

ionale for the main features of the

mpatibility among central process

scientific, real-time, and logical in

ha Appendices.

and controls, the logica1 design, from the organization of the data flow and the physical implementation ..

Introduction

The design philosophies of the new geoeral-purpase machine organization for the IBM System/360 are discussed in this paper.t In addition to showing the architecture• of the new family of data processing systems, we point out the various engineering problems encountered in attempts to make the system design compatible, at the program bit

,r large and small models. The compatibility was 1d not only to models of any size but also to their applications-scientific, commercial, real-time, and

term art:hittct11rt ~ used here to describe tbe attributes of a & 1een by tbe prorrammer, i.e. , tbe concf'ptual structurf' and I behavior, ;u distinct from the oraanization of the data flow rols, tbe logical design, and the physical implementation. tional details concerning the architecture, enginttring design, ~ill&', and application of the I BM System/ 360 will appear in a articles in the IBM Systems Jounal.

The section that follows describes the objectives of the new system design, i.e., that it serve as a base for new technologies and applications, that it be general-purpose, efficient, and strictly program compatible in all models. The remainder of the paper is devoted to the design problems faced, the alternatives considered, and the decisions made for data format, data and instruction codes, storage assignments, and input/ output controls.

Design obiectives

The new architecture builds upon but differs from the designs that have gradually evolved since 1950. The evolution of the computer had included, besides major technological improvements, several important systems concepts and developments: 87

JBM JOURNAL "- APRIL 1964

2019:We are hereHeterogeneity “In the Large”:• 1000’s of distinct Android

devices• IoTs even more diverse

Heterogeneity “In the Small”• Dozens of ISAs on chip +

Accelerators• Memory, Data Heterogeneity• Memory Consistency Models

https://www.anandtech.com/show/9330/exynos-7420-deep-dive/2

M-Comp Slim~~S sss SMDMA 0 0 a 6• SPI TMl.l ,.; .... ;! e:l <ti

,. E ~ ~ 9:a H~12C MCT A57 A57 I I .,, s..:: ~ :::l :::l

I •• UART WTG

A53 A53 ,2.

Unknown PCM

0 A57 A57 0 A53 A53 SPOlf

~ u PWM u

~ .... 256KB L7

,ANANDlieCH ,.,

A5 ~ w IJJ ~

2MB L2 IExynos '7420 Au<lo :-- --II

Abstracted IF,loor Pl■n

BUSO i0o&72015Aoikd1ftil!l!jjsal!\f SRN,1

BUSl II

...L. t •• -r JPEG G2D - MFC

' Shader -Seal~, Scalf'r

Core N ~ .... a:) a:)

AS Scaler l,,: :.;:: r-- ,-. N N

l~P .... ... 0 BNS 3AA U"I "I .... ~

PMU Shader Shader ~ IJJ IJJ ~ Core Core ~ --- IISP ,,t- 1Cameras ... {taratily u1tla!ownJ

3• t..ter"• StrtJDf Shader Shader Shader

"'"' J• •ZC WlG Core Core Core

I :. 5#1 11 LJART •• cs, CM3

I I • • • Shader Shader 0 0 J.Cl'i'• MONie u:~ u u Uni Pro

D!SPO o_

IOlSPl o_ Core Core 3~ ~ ~ Uf'.) 2.0 - ~ - "- ~ ~ >

Ml'1 OSl,iOfl M1,1os.1,'[)P

Seismic Shift: Entering a

Post-ISA World

ISAs still useful, but little/no relevance as abstraction layer

• Apple A8 and beyond: >50% of chip area is accelerators that have no ISA.

• NVIDIA PTX vs. SASS: ISA hidden under other layers.

Questions for the Post-Moore/Post-ISA Future:• How to program these highly heterogeneous

systems? How to manage the complexity of fast-changing hardware and software?

• How to verify them?• And what technologies come next?

This Talk: A whirlwind

tour of…


systems? How to manage the complexity of fast-changing hardware and software?• How to verify them?• And what technologies come next?

DECADES: A VERTICALLY-INTEGRATED APPROACH

• Enhance data locality• Optimize spatial mapping of threads • Enable in-memory computing

Language and Compiler Support

• Coarser than CGRA → VCGRTA• 3 classes of reconfigurable tiles• Reconfigurable interconnection network• Reconfigurable in-memory computing

Very Coarse-Grained Reconfigurable

Tile-Based Architecture

• Scalable full-system simulation• Multi-FPGA emulation infrastructure• 225-tile DECADES chip prototype

Multi-Tiered Demonstration Strategy

DECADES PLATFORM ARCHITECTURE

Program

Executable

Address-space

mapping

Initial Set of

In-Memory Operations

Initial Task

Mapping

DECADES Tile

Configuration

Interconnect Bridges

Configuration

Data Migration

(Across DDR nodes)

Update In-Memory

Operations

Task Migration

(across Tiles)

DECADE Tile

Reconfiguration

Bridge Ports

Re-Routing

Source Code

Compile-Time Optimizations

(Graphicionado, DeSC)

Run-Time

Self-Tuning

Data Segments

Definition

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

chip

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

L1 Cache

Configurable

Interconnect

Shims

Configurable Core Pipeline

Data Supply / Compute Threads L1 Cache

Per-Tile Configurable

On-chip Memory

DECADES Monitor and Run-Time Reconfiguration Shim

DECADES Core Tile

Configurable

Interconnect

Shims

Specialized, Configurable

Data Supply / Compute

Accelerator


On-chip Memory


DECADES Accelerator Tile

Configurable

Interconnect

Shims

Near-Memory

Computation

Across-Chip Configurable

On-Chip Memory


DECADES Intelligent Storage

Configurable Pattern-Based

Prefetcher

Run-Time

Self-Tuning

Run-Time

Self-Tuning

Run-Time

Self-Tuning

DECADES TA2 DECADES TA1

FPGAs: O↵-Chip Memory System

with In-Memory Computation



Heterogeneity meets coarse reconfigurability

DECADES Core and Accelerator tiles• Computations mapped onto

core tiles or available accelerator tiles• Each tile is wrapped in

monitor/reconfiguration shim

• Dynamic reconfiguration of Supply-Compute decoupling, power-performance tradeoffs, and interconnect

Program

Executable

Address-space

mapping

Initial Set of

In-Memory Operations

Initial Task

Mapping

DECADES Tile

Configuration

Interconnect Bridges

Configuration

Data Migration

(Across DDR nodes)

Update In-Memory

Operations

Task Migration

(across Tiles)

DECADE Tile

Reconfiguration

Bridge Ports

Re-Routing

Source Code

Compile-Time Optimizations

(Graphicionado, DeSC)

Run-Time

Self-Tuning

Data Segments

Definition

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

chip

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

L1 Cache

Configurable

Interconnect

Shims

Configurable Core Pipeline

Data Supply / Compute Threads L1 Cache


On-chip Memory


DECADES Core Tile

Configurable

Interconnect

Shims

Specialized, Configurable

Data Supply / Compute

Accelerator


On-chip Memory


DECADES Accelerator Tile

Configurable

Interconnect

Shims

Near-Memory

Computation

Across-Chip Configurable

On-Chip Memory


DECADES Intelligent Storage

Configurable Pattern-Based

Prefetcher

Run-Time

Self-Tuning

Run-Time

Self-Tuning

Run-Time

Self-Tuning

DECADES TA2 DECADES TA1





.J '--

' ,

.J '--> '

,

I I ) " " ( I I .J '--

' ,

.J "'-.

.J '--'

, '

, ) " ) '-

" ( " (

.J """ '

, ,.J

' :◄ I'

,

I 11 I ) " I I

I I

" ( I 11 I I I ,.J "'-.

' ,

.J '--,.J "'-.

' ,

' ,

+ +

DECADES Storage Specialization• Specialization #1:

Map apps onto mix of compute tiles and intelligent storage (IS) tiles• Specialization #2:

Select and configure appropriate storage features within IS• Configurable memory

banks + address and prefetching features

• Simple near-SRAM ALU

DECADES Intelligent Storage Tile

ConfigurableInterconnect

Shim

FIFO Head/Tail Pointer Array

Intelligent DMA Engine

Cache Pipeline

Pattern-Based Prefetcher

Configurable Bank Interconnect

Performance Monitoring Registers

Multi-grainAssociativeTag Array

ALU for near SRAM

processing

Configurable Memory Banks

Improving Latency-BoundApplications with Decoupled Execution

• Roots in seminal 1982 Smith DAE paper: Separately execute memory accesses (supply) from instructions that compute with them (compute)• Then: Latency tolerance “simpler” than out-

of-order execution• Now: Fits well with accelerator-oriented

design• Separate memory supply from accelerator • Orthogonal to DOALL parallelism and

bandwidth optimizations• Automatically identify and slice at compile

time

mo Y

w r e i

AQ

• • • ' da

Access Processor

gist r file

F g. 1.

ions

-·ns r c ions

AQ

AEBQ

E ecu e P oc s or

register fil

1 0 E Archi

DECADES Decoupled Supply-Compute Parallelism:Automatic Compilation + HW support

2 in-order CPUs Outperform 1 large out-of-order CPUs at 3X less power, area

Metric Improvement

Performance 10%

Area 33%

VI -c "' 0 ...J

<(

0 QI Cl

2 C QI

l:: QI c..

100

90

80

70

60

50

40

30

20

10

0 Kron Uni SW

BFS

Load Characterization

Kron Uni SW

SSSP

Kron Uni SW

PR Kron Uni SW

EWSD

Pow Exp Uni

GP

- Normal Loads = LLAMA (Terminal Load) [=:J Remaining Terminal Loads

c:::J Supply Exclusive Loads c:::::::::J Compute Exclusive Loads

DECADES Reconfigurable Parallelism• Coarse-grained reconfigurability• Workflow Sinkhorn contains a sequence of two

kernel calls:• Sparse Elementwise• Dense Matrix Multiplication

• Each benefit from different system configurations for a finite number of resources• Sparse benefits from decoupled supply/compute• Dense benefits from traditional parallelism

• Decades architecture can be reconfigured to either use tiles as suppliers, or all tiles as compute

Example: 6-tiled system configs

C C

C C C

C

All Compute

C C C

S S S

Half Suppliers

Application All Compute Half Suppliers

Sparse 5x 7x

Dense 6x 5x

Speedups

□□□ □□□

□□□ □□□


tour of…



Post-ISA Hardware DesignSOCs comprised of many CPUs, GPUs, and accelerators from many vendors-> How to verify that a given block does what it is specified to do?-> How to formally specify what blocks should do?

-> Formal Interface Specifications are the new ISA!

https://www.anandtech.com/show/9330/exynos-7420-deep-dive/2

M-Comp <,lim<,<,<, ..,..,.., ',MOMA 0 ~ "' 6• SPI TMJ .., "'

:: ,, u . ~ in al ~ = -:: ~ 9• ►tS1.Zl M(I

A57 A57 I I "' "' s .~ ~ ) :::)

I •• UART W'G

A53 A53 ,i, PlM

0 Unknown A57 A57 0 A53 A53 SPDIF

~ u PWM

u ~ - 256KBL2

ANANDTECH ..,

A5 ~ UJ w ~

2MB L2 Exynos 7420 "~"'° > - -II

Abstracted F,loor Pl■n

BUSO Go6(2015 Andrtl ftulllUSaftU SR,1/,I

BUSl II

....L. I II T .... JP[G GW - MR ...... ShadPr -xaler xaler

Core N N ~

a:) a:)

',cal er :,,,:: :.;:: A5 ,... .....

N N ISP .... ...

0 BNS 3AA U"I .ti "' ~ PMS Shader Shader ~ w w > Core Core > - IISP + Cameras - (largely unknown}

3• l•ter"• 5,eri,or Shader Shader Shader rwM 3, ;:c WT::; Core Core Core .~ ;:,_ ~Pl 11 .;ART 41 (SI

lMJ

:x:

I I I

I I Shader Shader 0 0 ,.-:-\. MCNl.ti u.. ~ u u Uni Pro

OISPO Q_

DtSPl Q_ Core Core ~:;; ~ ~ UfS 2.0 "'- Q. ~-c Q.

~ ~ > > \wlllPI ;)-51,,DP , .. P1 :,-.S.1;:)P

The Check Suite: An Ecosystem of Tools For Verifying Memory Orderings and their Security Implications

High-Level Languages (HLL)

Compiler

Architecture (ISA)

Microarchitecture

OS

RTL (e.g. Verilog)

PipeCheck [Micro ‘14] [IEEE MICRO Top Picks]

TriCheck [ASPLOS ‘17] [IEEE MICRO Top Picks]

CCICheck [Micro ‘15] [Nominated for Best Paper Award]

COATCheck [ASPLOS ‘16] [IEEE MICRO Top Picks]

RTLCheck [Micro ‘17] [IEEE MICRO Top Picks Honorable Mention]

Our Approach• Axiomatic specifications -> Happens-before graphs• Check Happens-Before Graphs via Efficient SMT solvers • Cyclic => A->B->C->A… Can’t happen• Acyclic => Scenario is observable

A

CB

CheckMate[Micro ‘18][IEEE Micro Top Picks]

PipeProof[Micro ‘18][Best Paper Nominee.IEEE Micro Top PicksHonorable Mention]

For more info: check.cs.Princeton.edu

c;

Check: Formal, Axiomatic Models and Interfaces

Coherence Protocol (SWMR, DVI, etc.)

Lds.

L2WB

Mem.

SB

L1Exec.

Dec.

Fetch

WB

Mem.

SB

L1Exec.

Dec.

FetchAxiom "PO_Fetch":forall microops "i1",forall microops "i2",SameCore i1 i2 /\ ProgramOrder i1 i2 =>

AddEdge ((i1, Fetch), (i2, Fetch), "PO").

Axiom "Execute_stage_is_in_order":forall microops "i1",forall microops "i2",SameCore i1 i2 /\EdgeExists ((i1, Fetch), (i2, Fetch)) =>AddEdge ((i1, Execute), (i2, Execute), "PPO").

Microarchitecture Specification in μSpec DSL

Microarchitectural happens-before (µhb) graphs

Exhausti

ve co

nsideratio

n of

all possi

ble execu

tions

TriCheck Framework: Verifying Memory Event Ordering from Languages to Hardware

HLL Mem Model Sim

ISAMem Model

uArchMem Model

Obs. Not obs

Permit ok Overstrict

Forbid Bug ok

High-level LangLitmus tests

HLL->ISA Compiler Mappings

ISA-levelLitmus tests Observable/

Unobservable

Permitted/Forbidden

Compare Outcomes

.... ~

1 r

'

,. 1 r

'

,Ir 1 r

'

.... ~

RISC-V Case Study• Apply TriCheck to 7 legal RISC-V implementations: • All abide by current RISC-V spec• Vary in preserved program order and store atomicity

• Results:• Impossible to compile C11 for RISC-V as specified. • Insufficiently strong fence instructions, load-load

reordering, …• Out of 1,701 tested C11 programs:• RISC-V-Base-compliant design allows 144 buggy outcomes• RISC-V-Base+A-compliant design allows 221 buggy

outcomes

Takeaway: Draft RISC-V spec could not serve as a legal C11 compiler target

Next Steps: RISC-V Memory Model Working Group formed to address these issues. Ratified a new and formally specified RISC-V memory model that supports C11, Linux, etc.

From Memory Consistency Models to Security: What we would like…

Formal interface and specification of given system implementation

1. Specify system to study

E.g. Subtle event sequences during program’s execution

2. Specify attack pattern

Either output synthesized attacks. Or determine that none are possible3. Synthesis

The CheckMate Tool:Automated Attack Discovery & Synthesis

Axiomatic specifications similar to Check tools

1. Specify system to study

Event sequences as graph snippets2. Specify attack pattern

Relational Model Finding (RMF) approaches3. Synthesis

• What we did: Developed a tool to do this, based on the uHB graphs from previous sections.

• Results: Automatically synthesized Spectre and Meltdown, as well as two new distinct exploits and many variants.

[Trippel, Lustig, Martonosi. https://arxiv.org/abs/1802.03802][Trippel, Lustig, Martonosi. MICRO 2018. October, 2018] http://check.cs.princeton.edu/papers/ctrippel_MICRO51.pdf

Microarchitecture-Aware Security Verification

Execute

Store Buffer

L1 ViCL Create

μrf

μrf

Microarchitecture

μhb Pattern

Load being sourced from the

store buffer

Enum

erate

all

possi

ble ex

ecut

ion

grap

hs w

ith pa

ttern

#cores = 1#threads = 1#instr ≤ 2

ExecutionConstraints

Fetch

Execute

Commit

Store Buffer

L1 ViCL Create

L1 ViCL Expire

Main Memory

Complete

W [x]à1 R [x]àr0

μrf

Core 0

μrf

μhb Graph

CheckMate

Axiom "PO_Fetch":forall microops "i1",forall microops "i2",SameCore i1 i2 /\ ProgramOrder i1 i2 =>

AddEdge ((i1, Fetch), (i2, Fetch), "PO").

Axiom "Execute_stage_is_in_order":forall microops "i1",forall microops "i2",SameCore i1 i2 /\EdgeExists ((i1, Fetch), (i2, Fetch)) =>

AddEdge ((i1, Execute), (i2, Execute), "PPO").

+

+

Overall Results: What exploits get synthesized?And how long does it take?

24

The Check Suite: Bug Finds and Other Key Takeaways

So far, tools have found bugs in:• Widely-used Research simulator• Cache coherence paper• In-design commercial processors• RISC-V ISA specification• Compiler mapping proofs• IBM XL C++ compiler (fixed in v13.1.5)• C++ 11 mem model

+ Spectre Prime, MeltdownPrime, and other Vulnerabilities automatically synthesized

Key Takeaways• Need formal, well-

specified interfaces• From well-specified

interfaces-> Interaction models and analysis

• From well-specified interfaces -> Reliability and performance metrics


tour of…



QC Algorithms to Machines Gap: Next Ten Years = NISQ Era

Grovers Algorithm (Database search)

Shor’s Factoring Alg. (Crypto)

Gap!

Quantum Sim, Q Chem, QAOA

#Qubits

1

10

100

10000

100000

1000000

1000

Year1995 2005 20252015

• Noisy Intermediate-Scale Quantum (NISQ)• Preskill, Jan 2018• 10-1000 qubits

• Too small for known algorithms with exponential speedup

• Too small for ECC

• Large enough to support interesting experiments!

NISQ

1' J, ,,

QC Algorithms to Machines Gap: Opportunity

QC programming and design tools that shrink the gap can move the feasibility point years sooner!• Reduce algorithm

qubit requirements• Improve effectiveness

of hardware qubits

Grovers Algorithm (Database search)

Shor’s Factoring Alg. (Crypto)

Gap!

Quantum Sim, Q Chem, QAOA

#Qubits

1

10

100

10000

100000

1000000

1000

Year1995 2005 20252015

MoreWork

Needed!

NISQ

-I / " J

• 1' I J. _I }~~-~ - -

,..

~

\ I y

An Architect’s Comparative Study of NISQ Machines

Architectural choices and design questions• Gate sets, Connectivity, Noise properties of qubits

In-depth exploration• Real-systems runs across 7 NISQ systems (3 vendors)• Enabled by our multi-vendor compiler toolflow: fair and accurate

comparisons avoiding inefficiencies from vendor compilers

Design insights from our study• Gate set choices• Connectivity choices• NISQ execution stack

[Murali et al. International Symposium on Computer Architecture. June, 2019. https://arxiv.org/abs/1905.11349 ]

https://arxiv.org/abs/1905.11349

Large Variations in Machine Characteristics

• Optimized compilation for qubit communication is not sufficient!• Large spatial and

temporal variations in gate error rates!

0.35 .......-------------------~

0.30

2 0.25 ro s....

0 0.20 s.... s.... QJ

QJ 0.15 .µ ro

(.!J 0.10

0.05

0 5 10 15 Day

• CNOT 5,4

◄ CNOT 7,10

CNOT 3,14

20 25

Large Variations in Machine CharacteristicsPart 2

• Large spatial and temporal variations in Qubit Coherence Times• How long they hold their state

120 • QO

◄ Q4 ► Q9

100

-V') ::::, - 80 Q)

E +J 60 N r-

40

20

0 5 10 15 20 25 Day

Machine Qubit Topology

IBM Q5 Tenerife

IBM Q14 Melbourne

IBM Q16 Rüschlikon

RigettiAgave

RigettiAspen1

RigettiAspen3

UMD Trapped Ion (UMDTI)

Name Qubits Gates

BV4 4 12

BV6 6 12

BV8 8 18

HS2 2 16

HS4 4 28

HS6 6 42

Toffoli 3 18

Fredkin 3 19

Or 3 17

Peres 3 16

QFT 2 13

Adder 4 23

ProgramsMachines

TriQCompiler

Putting it all together…

LO

l'.t:: "" 0. ""

- m Q5 IBM . 14 18 •Q}fi - r e 1i ~ - Ri ettikipe 3 - UIMDTI

B'V6 B'V HS2 6 Toffo"lm Or Peres Benchmarks

l.lU::33 r t for 12 n hn ar , u . · o , ·. l!J~~::;;~ dr ti. · l,.r c d appl cation-m eb.i ropol 01. kb., Beu.ehn11ar: . . tlh

" . R [ , 1 tb z ro h igb . bar e r p nd bt C: d l!J utpqt . f' , conipar on · i it.ended to " .liler'. tarid the wq,,act of ,arcliitecJural · ,.,, choice rJch as .rlfle •

,be,u:lunar, _perJorimu.a:c;,,e and i not in.te.nde ,to _pie a · innin ,. te li,iolo , . rpemJor or imple,men:tatio.,i. I ndirv ual. bench11UlT _perJornia,ice ,Hun,ber nui-,, ci , o er nme ......... ,ti .e , ea ur ments,. _pr,e 11t a llllfJ hot of'the_p 1'/o.r,,nanc of ti, e · tenis i 1en performe the , . peril. nts.

OS

High-Level Languages

Compiler

Architecture

VLSI Circuits

Semiconductor transistors

Algorithms Algorithms

Qubit implementations

High-level QC Languages.Compilers. Debugging.

Optimization.Error Correcting Codes

Orchestrate classical gate control,

Orchestrate qubit motion and manipulation.

~1950’s Classical Computing

Vacuum Tubes, Relay Circuits

Assembly Language

Algorithms

Quantum Systems Today: An Analogy

Quantum Toolflows

Modular hardware blocks: Gates, registers

Today’s Classical Computing

OS


Compiler

Architecture

VLSI Circuits

Semiconductor transistors

Algorithms Algorithms

implementations

Post-ISA Systems: Automated full-Stack compilation via APIs and formally-spec’d interfaces

Post-ISA Toolflows

Modular hardware blocks: Gates, registers


Classical Layering

App-specific Approaches

App-specific Approaches

Examples:QC Toolflows

TensorFlow and TPUs

…

Potential for a Seismic Shift in Computer Systems Design

Thanks to:Students and Co-AuthorsFunding: NSF, DARPA, SRC CFAR, SRC ADA.

For more info:http://check.cs.Princeton.eduhttp://mrmgroup.cs.Princeton.edu

Quantum Resources:CRA CCC Workshop Report “Next Steps in QC: Computer Science’s Role”https://arxiv.org/abs/1903.10541

National Academies Report “Quantum Computing Progress and Prospects”https://www.nap.edu/catalog/25196/quantum-computing-progress-and-prospects

"Any opinions, findings, conclusions or recommendations

expressed in this material are those of the author(s) and do not

necessarily reflect the views of the Networking and Information

Technology Research and Development Program."

The Networking and Information Technology Research and Development

(NITRD) Program

Mailing Address: NCO/NITRD, 2415 Eisenhower Avenue, Alexandria, VA 22314

Physical Address: 490 L'Enfant Plaza SW, Suite 8001, Washington, DC 20024, USA Tel: 202-459-9674,

Fax: 202-459-9673, Email: [email protected], Website: https://www.nitrd.gov

(it,NITRD -

mailto:[email protected]

https://www.nitrd.gov/

seismic shift: what's next in the post-moore, post-isa era?€¦ · e., tht conceptual...

Documents