climate machine update david donofrio ramp retreat 8/20/2008

15
Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Post on 21-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Climate Machine Update

David Donofrio

RAMP Retreat

8/20/2008

Page 2: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Agenda

• Project Overview

• Tensilica Architecture and Design Flow

• Tensilica Tools Demo

• Why we need RAMP

• Current Progress

• Next Steps

Page 3: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

A New Approach to HPC

• Current HPC Design approach:– Leverage commodity processors

from Intel, AMD, etc– Once machine is built, optimize

problems to run on it – Power wall prevents scaling to

exaflop performance– Power is the new design point

Olukotun and Sutter

Moore’s Law still in effect - but number of processors double every

18 months rather than clock rate

Page 4: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

A New Approach to HPC• Our approach:

– Identify application, then tailor machine using semi-custom design – Optimize CPU architecture and further extend with semi-custom ISA– Leverage auto-tuning to access architecture specific optimizations– Even if each simple core is 1/4 as computationally efficient as a

complex core you can fit hundreds on a single die and be 100x more power efficient

• Learn from embedded market where Flops / Watt and rapid design cycles are crucial– Start with building blocks from embedded designs rather than full

custom ASIC– Preserve ability to run general purpose C code

• Application Target: 1km Scale Climate ModelTailor machine architecture to application to

reduce waste

Page 5: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Climate Model Resource Requirements

• DOE has identified high-resolution climate modeling as a leading justification for exascale computing

• Must express 20M way parallelism• Requires performance of 200 Pflops peak• Simulation must run 1000x faster than real time

Randall / CSU

NASA

QuickTime™ and a decompressor

are needed to see this picture.

• Amenable to massively concurrent architectures composed of power efficient embedded cores.• Actively working with the climate science community to enable new Icosahedral model

Page 6: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Tensilica Processor Design Flow

• Complete Solution: Hardware, Software and Verification

• Fully customizable– Required base ISA ensures

general purpose applications

• Processor configuration submitted to Tensilica’s servers where synthesis is performed– Returned design can be spun for

ASIC or FPGA

– Bit file available for Avnet boards

• Building block approach drastically reduces design cycle time compared to full-custom design

Tensilica Inc.

Page 7: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Tensilica Architecture Features

• Verilog-like TIE language allows for custom ISA extensions– Functional and performance verification built in– Auto generated compiler intrinsics– 64-bit IEEE-DP floating point coded up in TIE and available

• Custom VLIW support• Inter-processor communication easily enabled

through:– TIE Ports– TIE Queues

• Access to direct HW support for interprocessor communication

– TIE Lookups• Allows interface to external ROMs or other RTL block

Page 8: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Tensilica Architecture Overview

QuickTime™ and a decompressor

are needed to see this picture.

Tensilica Inc.

Page 9: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Tensilica Performance Debug• Processor viewed as black box• State can be compressed (via HW) and pushed out

JTAG port– Intended for program replay

• Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail – $ hit miss with virtual address– Branch taken / not taken– Call / return– Resource dependency– Etc…

• Opportunity for hundredsof performance countersto be made available

QuickTime™ and a decompressor

are needed to see this picture.

Tensilica Inc.

Page 10: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Tensilica Tools Demo

QuickTime™ and a decompressor

are needed to see this picture.

Page 11: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Why we need RAMP• Fast, accurate emulation enables:

– Dual nested loop of HW / SW co-design• Preliminary work using Stanford SM sim shows significant

improvement in power eff. using automated HW/SW co-tuning• RAMP critical to accelerate

– Rapid prototyping and analysis of Tensilica architectural options

– Inter-processor communication architecture exploration– Running FULL climate code providing a more complete

performance picture

• Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5– Extensive HW performance counter data enables an

emulation environment with similar resolution but much greater speed

Tensilica provided emulation environment kick-starts this effort

Page 12: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Current Status

• ML505 used for initial design exploration– Basic xtensa processor + JTAG and memory

controller is ~50% of a Virtex 5 50t– Runs at 50MHz

• ASIC in 65G process runs at 650MHz

• OnChip Debug working • Can load / run programs using main memory

synthesized from BRAM• DRAM interface coded - currently being

debugged• RTL license recently obtained - full simulation

environment (in ModelSim) being brought up

Page 13: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Next Steps…

• Transition to BEE3 from ML505• Bring up XTOS environment on single xtensa

processor on BEE3• Run single column of climate code on single

processor – Demo at SC’08 in November– Continue HW / SW co-tuning optimization

• Begin multi-processor emulation– Emulation of single socket, 32 core, using

networked BEE3s– Running full 2 Million line climate model

Page 14: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Backup

Page 15: Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

The Need for Exascale Computing

• DOE has identified high-resolution climate modeling as leading justification for exascale computing– 1 km resolution targeted for accurate cloud

resolving model

• Difficult to scale existing systems– HPC design using commodity processors

estimated to draw 179MW– BlueGene design estimated to draw 20MW– Leveraging embedded cores and more

application specific design a power envelope of 3-5MW is projected

Icosahedral

LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market.

Randall / CSU