exascale computing: embedded style · 2019. 2. 15. · • 1 rack = 32 groups (=1.7 pf/s) • 384...

37
1 HPEC 9/22/09 Peter M. Kogge McCourtney Chair in Computer Science & Engineering Univ. of Notre Dame IBM Fellow (retired) May 5, 2009 TeraFlop Embedded PetaFlop Departmental ExaFlop Data Center Exascale Computing: Embedded Style

Upload: others

Post on 06-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

1HPEC 9/22/09

Peter M. KoggeMcCourtney Chair in Computer Science & Engineering

Univ. of Notre DameIBM Fellow (retired)

May 5, 2009

TeraFlopEmbedded

PetaFlopDepartmental

ExaFlopData Center

Exascale Computing:Embedded Style

Page 2: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

2HPEC 9/22/09

Thesis

• Last 30 years: “Gigascale” computing first in a single vector processor“Terascale” computing first via several thousand microprocessors“Petascale” computing first via several hundred thousand cores

• Commercial technology: to dateAlways shrunk prior “XXX” scale to smaller form factorShrink, with speedup, enabled next “XXX” scale

• Space/Embedded computing has lagged far behindEnvironment forced implementation constraintsPower budget limited both clock rate & parallelism

• “Exascale” now on horizonBut beginning to suffer similar constraints as spaceAnd technologies to tackle exa challenges very relevant

Especially Energy/Power

Page 3: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

3HPEC 9/22/09

Topics

• The DARPA Exascale Technology Study• The 3 Strawmen Designs• A Deep Dive into Operand Access

Page 4: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

4HPEC 9/22/09

Disclaimers

This presentation reflects my interpretation of the final report of the Exascale working group only, and not necessarily of the universities, corporations, or other institutions to which the members are affiliated.

Furthermore, the material in this document does not reflect the official views, ideas, opinions and/or findings of DARPA, the Department of Defense, or of the United States government. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf

Note: Separate Exa Studies on Resiliency & Software

Page 5: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

5HPEC 9/22/09

11 Academic 6 Non-Academic 5 “Government”+ Special Domain Experts

The Exascale Study Group

UT-AustinSteve KecklerColumbiaKeren Bergman

AffiliationNAMEAffiliationNAME

MicronDean KleinIntelShekhar Borkar

Notre DamePeter KoggeGTRIDan Campbell

UC-BerkeleyKathy YelickSTASherman Karp

HPStan WilliamsSTAJon Hiller

LSUThomas SterlingAFRLKerry Hill

SDSCAllan SnavelyDARPABill Harrod

CraySteve ScottNCSUPaul Franzon

AFRLAl ScarpeliIBMMonty Denneau

Georgia TechMark RichardsStanfordBill Dally

USC/ISIBob LucasIDABill Carlson

10+ Study Meetings over 2nd half 2007

Page 6: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

6HPEC 9/22/09

The DARPA Exascale Technology Study

• Exascale = 1,000X capability of Petascale• Exascale != Exaflops but

Exascale at the data center size => ExaflopsExascale at the “rack” size => Petaflops for departmental systems Exascale embedded => Teraflops in a cube

• Teraflops to Petaflops took 14+ years 1st Petaflops workshop: 1994Thru NSF studies, HTMT, HPCS …To give us to Peta now

• Study Questions: Can we ride silicon to Exa By 2015?What will such systems look like?

• Can we get 1 EF in 20 MW & 500 racks?Where are the Challenges?

Page 7: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

7HPEC 9/22/09

The Study’s Approach

• Baseline today’s:Commodity TechnologyArchitecturesPerformance (Linpack)

• Articulate scaling of potential application classes• Extrapolate roadmaps for

“Mainstream” technologiesPossible offshoots of mainstream technologiesAlternative and emerging technologies

• Use technology roadmaps to extrapolate use in “strawman” designs• Analyze results & Id “Challenges”

Page 8: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

8HPEC 9/22/09

Context: Focus on Energy

Not Power

Page 9: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

9HPEC 9/22/09

CMOS Energy 101

Vin Vg

R & C C

Dissipate CV2/2And store CV2/2

Dissipate CV2/2From Capacitance

One clock cycle dissipates C*V2

Page 10: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

10HPEC 9/22/09

ITRS Projections

Assume capacitance of a circuitscales as feature size

330X

15X

90nm picked as breakpoint because that’s when Vdd and thus clocks flattened

Page 11: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

11HPEC 9/22/09

Energy Efficiency

Page 12: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

12HPEC 9/22/09

The 3 Study Strawmen

Page 13: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

13HPEC 9/22/09

Architectures Considered

• Evolutionary Strawmen “Heavyweight” Strawman based on commodity-derived microprocessors“Lightweight” Strawman based on custom

microprocessors

• Aggressive Strawman“Clean Sheet of Paper” CMOS Silicon

Page 14: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

14HPEC 9/22/09

A Modern HPC System

Power DistributionMemory

9%

Routers33%

Random2%

Processors56%

Silicon Area Distribution

Processors3%

Routers3% Memory

86%

Random8%

Board Area DistributionMemory

10%

Processors24%

Routers8%

White Space50%

Random8%

Page 15: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

15HPEC 9/22/09

A “Light Weight” Node Alternative

2 Nodes per “Compute Card.” Each node:• A low power compute chip• Some memory chips• “Nothing Else”

System Architecture:• Multiple Identical Boards/Rack• Each board holds multiple Compute Cards• “Nothing Else”

• 2 simple dual issue cores• Each with dual FPUs• Memory controller• Large eDRAM L3• 3D message interface• Collective interface• All at subGHz clock“Packaging the Blue Gene/L supercomputer,” IBM J. R&D, March/May 2005

“Blue Gene/L compute chip: Synthesis, timing, and physical design,” IBM J. R&D, March/May 2005

Page 16: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

16HPEC 9/22/09

Possible System Power Models:Interconnect Driven

• Simplistic: A highly optimistic modelMax power per die grows as per ITRSPower for memory grows only linearly with # of chips

• Power per memory chip remains constantPower for routers and common logic remains constant

• Regardless of obvious need to increase bandwidthTrue if energy for bit moved/accessed decreases as fast as “flops per second” increase

• Fully Scaled: A pessimistic modelSame as Simplistic, except memory & router power grow with peak flops per chipTrue if energy for bit moved/accessed remains constant

• Real world: somewhere in between

Page 17: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

17HPEC 9/22/09

1 EFlop/s “Clean Sheet of Paper” Strawman

• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz)• 1 Chip = 742 Cores (=4.5TF/s)

• 213MB of L1I&D; 93MB of L2• 1 Node = 1 Proc Chip + 16 DRAMs (16GB)• 1 Group = 12 Nodes + 12 Routers (=54TF/s)• 1 Rack = 32 Groups (=1.7 PF/s)

• 384 nodes / rack• 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

• 166 MILLION cores• 680 MILLION FPUs• 3.6PB = 0.0036 bytes/flops• 68 MW w’aggressive assumptions

Sizing done by “balancing” power budgets with achievable capabilities

Largely due to Bill Dally, Stanford

Page 18: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

18HPEC 9/22/09

A Single Node (No Router)

NodePower

Characteristics:• 742 Cores; 4 FPUs/core • 16 GB DRAM• 290 Watts• 1.08 TF Peak @ 1.5GHz• ~3000 Flops per cycle

“Stacked” Memory

Page 19: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

19HPEC 9/22/09

1 Eflops Aggressive Strawman Data Center Power Distribution

• 12 nodes per group• 32 groups per rack• 583 racks• 1 EFlops/3.6 PB• 166 million cores• 67 MWatts

Page 20: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

20HPEC 9/22/09

Data Center Performance Projections

Exascale

But not at 20 MW!

Heavyweight

Lightweight

Page 21: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

21HPEC 9/22/09

Power EfficiencyExascale Study:

1 Eflops @ 20MW

UHPC RFI:Module: 80GF/WSystem: 50GF/W

Aggressive Strawman

Page 22: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

22HPEC 9/22/09

Energy Efficiency

Page 23: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

23HPEC 9/22/09

Data Center Total Concurrency

Billion-way concurrency

Million-way concurrency

Thousand-way concurrency

Page 24: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

24HPEC 9/22/09

Data Center Core Parallelism

170 Million CoresAND we will need10-100X moreThreading forLatency Management

Page 25: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

25HPEC 9/22/09

Key Take-Aways

• Developing Exascale systems really tough In any time frame, for any of the 3 classes

• Evolutionary Progression is at best 2020ishWith limited memory

• 4 key challenge areas Power:Concurrency: Memory CapacityResiliency

• Requires coordinated, cross-disciplinary efforts

Page 26: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

26HPEC 9/22/09

Embedded Exa:A Deep Dive into

Interconnectto Deliver Operands

Page 27: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

27HPEC 9/22/09

Tapers• Bandwidth Taper: How effective bandwidth of operands

being sent to a functional unit varies with location of the operands in memory hierarchy.

Units: Gbytes/sec, bytes/clock, operands per flop time

• Energy Taper: How energy cost of transporting operands to a functional unit varies with location of the operands in the memory hierarchy.

Units: Gbytes/Joule, operands per Joule

• Ideal tapers: “Flat”–doesn’t matter where operands are.

• Real tapers: huge dropoffs

Page 28: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

28HPEC 9/22/09

An Exa Single Node for Embedded

Page 29: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

29HPEC 9/22/09

The Access Path: Interconnect-Driven

ROUTER

MemoryMICROPROCESSOR

MoreRouters

Some sort of memory structure

Page 30: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

30HPEC 9/22/09

Sample Path – Off Module Access1. Check local L1 (miss)2. Go thru TLB to remote L3 (miss)3. Across chip to correct port (thru routing table RAM)4. Off-chip to router chip5. 3 times thru router and out6. Across microprocessor chip to correct DRAM I/F7. Off-chip to get to correct DRAM chip8. Cross DRAM chip to correct array block9. Access DRAM Array10. Return data to correct I/R11. Off-cchip to return data to microprocessor12. Across chip to Routre Table13. Across microprocessor to correct I/O port 14. Off-chip to correct router chip15. 3 times thru router and out16. Across microprocessor to correct core17. Save in L2, L1 as required18. Into Register File

Page 31: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

31HPEC 9/22/09

Taper Data from Exascale Report

Page 32: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

32HPEC 9/22/09

Bandwidth Tapers

1,000X Decrease Across the System!!

Page 33: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

33HPEC 9/22/09

Energy Analysis: Possible Storage Components

Cacti 5.3

Extrapolation

Technology for 2017 system

Page 34: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

34HPEC 9/22/09

Summary Transport Options

1 Operand/word = 72 bits

Page 35: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

35HPEC 9/22/09

Energy Tapers vs Goals

Page 36: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

36HPEC 9/22/09

Reachable Memory vs Energy

Page 37: Exascale Computing: Embedded Style · 2019. 2. 15. · • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

37HPEC 9/22/09

What Does This Tell Us?• There’s a lot more energy sinks than you think

And we have to take all of them into consideration• Cost of Interconnect Dominates• Must design for on-board or stacked DRAM, with

DRAM blocks CLOSEtake into account physical placement

• Reach vs energy per access looks “linear”• For 80GF/W, cannot afford ANY memory references• We NEED to consider the entire access path:

Alternative memory technologies – reduce access costAlternative packaging costs – reduce bit movement costAlternative transport protocols – reduce # bits movedAlternative execution models – reduce # of movements