0 arun rodrigues, scott hemmert, dave resnick: sandia national lab (abq) keren bergman: columbia...

24
1 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory Gilbert Hendry: Sandia National Laboratory Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab Sudhakar Yalamanchili: Georgia Tech Data Movement Dominates (DMD) and CoDEx: CoDesign for Exascale

Upload: angelina-norman

Post on 15-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

1

Arun Rodrigues, Scott Hemmert, Dave Resnick:

Sandia National Lab (ABQ)

Keren Bergman: Columbia University

Bruce Jacob: U. Maryland

John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory

Gilbert Hendry: Sandia National Laboratory

Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab

Sudhakar Yalamanchili: Georgia Tech

Data Movement Dominates (DMD)and

CoDEx: CoDesign for Exascale

Page 2: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign

SST

• System level models

ACE

• Node level emulation

ROSE• Application

Analysis

ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.

ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.

SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.

SST Micro Software Simulators: Software simulation for node-level simulation

Page 3: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign

SST

• System level models

ACE

• Node level emulation

ROSE• Application

Analysis

ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.

ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.

SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.

SST Micro Software Simulators: Software simulation for node-level simulation

CoDEx: CoDesign For Exascale

ASCR-funded Simulation Infrastructure Project

SST: Structure Simulation Toolkit

NNSA-funded Simulation Tools

(ASC Program)

Page 4: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign

SST

• System level models

ACE

• Node level emulation

ROSE• Application

Analysis

ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.

ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.

SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.

SST Micro Software Simulators: Software simulation for node-level simulation

CoDEx: CoDesign For Exascale

ASCR-funded Simulation Infrastructure Project

SST: Structure Simulation Toolkit

NNSA-funded Simulation Tools

(ASC Program)

CAL: (Sandia/LBL) Computer Architecture

Laboratory

Page 5: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Fidelity vs. Scope for Architectural Simulation Methods

5

Page 6: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

ROSE CompilerFull Program Understanding through Deep Source-Code Analysis

6

Page 7: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

• Can automatically predict performance for many input codes and software optimizations

• Predict performance under different architectural scenarios

• Much faster than hardware simulation and manual modeling

ExaSAT: Exascale Static Analysis ToolCompiler-Automated Performance Model Extraction

7

Combustion Codes

Compiler Analysis

Performance PredictionSpreadsheet

Dependency Graph Optimization

<XML>

UserParameter

s

Performance Model

Machine Parameter

s

Page 8: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

SST/macro: Coarse-Grained Simulation

8

An application code with minor modifications

SST/Macro Impl. of interfaces (MPI), which simulate execution and communication

Page 9: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

SST/micro: Cycle-Accurate Framework

• Has a general simulation framework for integrating models

• Simulation backend is parallel• Plenty of people involved

9

Page 10: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Some Models Currently Integrated

10

Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components.

MacSim provides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces.

IRIS provides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on-Chip (NoC) and inter-node interconnection architectures. PhoenixSim models photonic networks.

Page 11: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Leveraging Embedded Design Automation For Design Space Exploration

This stuff is essential!

Page 12: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Embedded Design Automation(Using FPGA emulation to do rapid prototyping)

RAMP FPGA-acceleratedEmulation of ASIC

Or “tape out”To FPGA

Page 13: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Data Movement Dominates (Sandia, Micron, Columbia, LBL)Understand the Potential of Intelligent, Stacked DRAM Technology

• Data movement are projected to account for over 75% of power budget for an exascale platform

• Work to reduce that via– Optical interconnect(s)– 3D stacking (logic + memory + optics)– New memory protocols

Research Questions– What is the performance potential of stacked memory (power &

speed)– How much intelligence to put into logic layer

• Atomics, gather/scatter, checksums, full-processor-in-memory

– What is the memory consistency model for intelligent DRAM– How to program it if we put embed more intelligence into DRAM

Page 14: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

The Cost of Moving Data

Page 15: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Locality Management is KeyWhat are the best combination of software and hardware

mechanisms to maximize data movement efficiency

Vertical Locality Management Horizontal Locality Management

15

Sun Microsystems

TemporalTopological

Page 16: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Why Study Chip Stacking (TSVs)?Energy = (V 2 C) Overhead + Ecomm ∗ ∗

DRAM Cells Efficient• DRAM cells require < 1 pJ to access • Current DRAM architectures are not

power efficient • Long distances high power ➔• We pay for more than we get at every

level – Cache: throw away 75-80% – DRAM Row: Charge 1024B for each 64B

access – DIMM: Charge 8-9 chips/access – ~800 pJ/byte total

• DRAM design driven by packaging constraints – ~50% of DRAM chip cost is packaging,

mainly in pins – DIMMs use multiple chips with a few

data pins to achieve high BW

TSVs Reduce Costs• TSVs orders of magnitude less energy • –250 fJ/bit for reading DRAM • –5 fJ/bit for TSV • –250 fJ/bit for mem. controller • –~0.5 pJ/bit (compared to 30pJ for

conventional DIMM) • –Don’t have to access more data than

needed • • Enables....

–Lower Capacitance: Narrower –Lower Overhead: Smarter –In-Memory computation

• • Requires • –...changes to how we view the

machine & the memory

16

Page 17: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Why Photonics?

TX RX

ELECTRONICS: Buffer, receive, and re-

transmitat every router.

Space Parallelism:Each bus lane routed independently (P NLANES).

Off-chip BW requires much more power than on-chip BW.

Photonics changes the rules for Bandwidth-per-Watt.

PHOTONICS: Modulate/receive data

stream once per communication event.

Wavelength Parallelism:Broadband switch routes entire multi-wavelength stream.

Off-chip BW ≈ on-chip BW for nearly same power.

RX

TXRX

RX

TX

RX

TX

RX

TXTX

Page 18: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

HBDRAM

HBDRAM

• Large Pin-out• Complex wiring• Low bandwidth density• Distance constrained by electrical

limitations• High power dissipation

• All-optical link, no electronic bus to drive• Bit-rate transparent link• High bandwidth density, less pins• Distance immunity at computer scale• Low power dissipation

Optical Link

Traditional Memory Optically-Connected Memory

Why Optically-Connected Memory?

Will not scale to meet power and bandwidth requirements of future high-

performance computing systems

Enables scaling of high-performance computing through increased memory

capacity and bandwidth

CPU

HBDRAM

HBDRAM

HBDRAM

HBDRAM

CPU

HBDRAM

HBDRAM

Electronic Bus

Page 19: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

19

Page 20: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Mixed Model Simulationcycle accurate and energy-accurate models

SST/macro

skeleton app (C, C++, Fortran)

(C++)NoC Model

(PhoenixSim)

Memory Model(DRAMSim2, FLASHsim, NVRAM)

Address Translation

Processor Model(SST/micro & Tensilica)

Workload Translation

kernels

SystemC

Fa

ult I

nje

ctio

n

Checkpoint/restart

MPI Traces(DUMPI)

Page 21: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Simulator Infrastructure: Interconnectscycle accurate and energy-accurate models

Developed by Sandia CollaboratorsCoDEx project

Page 22: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Simulator Infrastructure: Memorycycle accurate and energy-accurate models

Validated against Micron DRAMHMC model coming this summer

Page 23: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Simulator Infrastructurecycle accurate and energy-accurate models

Rewrote Columbia PhoenixSimsummer 2011

Orion-2 energy modelValidated against Cornell test parts

Page 24: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Simulator Infrastructurecycle accurate and energy-accurate models

Full Gate-level RTL model of processorWell characterized energy model