2 v3.pdf · 5 power&7&chip& power7: ibm’s next generation, balanced power server chip...

2 www.parallel.illinois.edu

Parallel @ Illinois

Illiac IV

UPCRC

Cloud Computing Testbed

OpenSparc Center of Excellence

CUDA Center of Excellence

Extreme Scale Computing


Blue Waters

•  Sustained petaflop/s on complex applica7ons (QCD, turbulence, molecular dynamics,…)

•  > 200,000 cores •  > 800 TB memory

•  >10 PB disk •  > 500 PB tape •  100-‐400 Gbps external BW

•  IBM Power 7 technology


POWER7: IBM’s Next Generation, Balanced POWER Server Chip

7

POWER7: Core Execution Units

2 Fixed point units2 Load store units4 Double precision floating point1 Branch1 Condition register 1 Vector unit1 Decimal floating point unit6 wide dispatch

Recovery Function Distributed1,2,4 Way SMT SupportOut of Order Execution32KB I-Cache32KB D-Cache256KB L2

Tightly coupled to core

Add Boxes

256KB L2

IFUCRU/BRU

ISU

DFU

FXU

VSXFPU

LSU

Hot Chip IBM Presenta7on


Power 7 Chip POWER7: IBM’s Next Generation, Balanced POWER Server Chip

4

POWER7 Processor Chip

567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM1.2B transistors

Equivalent function of 2.7BeDRAM efficiency

Eight processor cores12 execution units per core4 Way SMT per core32 Threads per chip256KB L2 per core

32MB on chip eDRAM shared L3Dual DDR3 Memory Controllers

100GB/s Memory bandwidth per chip sustained

Scalability up to 32 Sockets360GB/s SMP bandwidth/chip20,000 coherent operations in flight

Advanced pre-fetching Data and InstructionBinary Compatibility with POWER6

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.


Possible Power 7 Package

POWER7: IBM’s Next Generation, Balanced POWER Server Chip

6

POWER7 Design Principles:

Cores:8, 6, and 4-core offerings with up to 32MB of L3 CacheDynamically turn cores on and off, reallocating energyDynamically vary individual core frequencies, reallocating energyDynamically enable and disable up to 4 threads per core

Memory Subsystem:Full 8 channel or reduced 4 channel configurations

System Topologies:Standard, half-width, and double-width SMP busses supported

Multiple System Packages

!"#$%&%"%'()*+,)-,*.'*&%"%'(

2/4s Blades and RacksSingle Chip Organic

High-End and Mid-RangeSingle Chip Glass Ceramic

Compute IntensiveQuad-chip MCM

/)0#123()42+'32""#35)67)"28*")"%+9: ;)0#123()42+'32""#3:

5)<7)"28*")"%+9:;)<7)=#12'#)"%+9:

<)0#123()42+'32""#3:5)/>7)"28*")"%+9:)?2+)040@

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.


Performance growths 1,000-‐fold every 11 years

(Kogge)

Can we achieve the next jump?


Moore’s Law Con7nues Moore’s Law is Alive and Well

3

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Will continue in coming decade

(Olokotum)


Clock Frequency Stagnant But Clock Frequency Scaling

Replaced by Scaling Cores / Chip

4

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Cores

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

15 Years of exponential growth ~2x year has ended

(Olokotum)


Performance Has Also Slowed, Along with Power

5

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Frequency (MHz)

Power (W)

Perf

Cores

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Power is the root cause of all this

Future increases in performance will come only from increases in number of concurrent threads

End of Single-‐Thread Era Li[le/no benefit from increased transistor count Decreasing benefit from frequency increases Power limits reached


Number of Cores Increases Rapidly This has Also Impacted

HPC System Concurrency

Exponential wave of increasing concurrency for forseeable future! 1M cores sooner than you think!

6

Sum of the # of cores in top 15 systems (from top500.org)

Sum # cores top 15 systems

A million cores in a couple of years; a billion threads in a decade?


Power Budget

!

!""#$%#&'%()*+!"",

-.",

-"",

-"",

//.",

0&123&%453%1256'278'459746:259

+2;&'%9<##7=%7299&9

.>,

"#$%&'(

)($#*+

"#$

,-./

8()*+%?41@:5&%6234=

-"8A%3:9B%C%-8AD3:9B%C-",

"E-AD()*+%C%-E.5$%#&'%A=6&

-""#$%12F%#&'%()*+

0123(*452)126('452712894:5 GW for Exaflop/s with today’s technology 100 MW in a decade?

(Borkar)


Aggressive Power Scaling

!

The Power & Energy Challenge

!""#

$%"#

$""#

$""#

&%%"#

%'#

"#$%&'(

)($#*+

"#$

,-./

()*+,-./01234-567/8

%#!#9%#9:#%#

()*+,-./01234-5143#251-;</-(40136=6>8

0123

~1B threads Heterogeneous architecture Mostly nearest-neighbor communication Long “cache lines” High error-rates

(Borkar)

20 MW for Exaflop/s with aggressive, likely custom design; and harder to program machine Software to the rescue!


Main Issues

• Increased parallelism

• Need for locality • Heterogeneity • Resilience • Variability • Virtualiza7on • Socializa7on

14


Managing 1B threads



15


Scaling ApplicaCons

Weak scaling: use more powerful machine to solve larger problem –  increase applica7on size and keep running 7me constant; e.g., refine grid

•  Larger problem may not be of interest –  Iden7fy problems that require petascale performance

•  May want to scale 7me, not space (e.g., molecular dynamics) –  Study parallelism in 7me domain

•  Cannot scale space without scaling 7me (itera7ve methods): granularity decreases and communica7on increases


Scaling IteraCve Methods •  Assume that number of cores (and compute power)

increases by factor of k

•  Space and 7me scales are refined by factor of k1/4

•  Mesh size increases by factor of k3/4

•  Per core cell volume decreases by factor of k1/4

•  Per core cell area decreases by a factor of k1/4×2/3 = k1/6

•  Area to volume raCo (communica7on to computa7on ra7o) increases by factor of k1/4/ k1/6 = k1/12

•  Per core computa7on is finer grained and needs rela7vely more communica7on

•  (Per chip computa7on is coarser grained and and needs rela7vely less communica7on if most increase in # cores is per chip)


Debugging and Tuning: Observing 1B Threads

•  Scalable infrastructure to control and instrument 1B threads

•  On-‐the-‐fly sensor data stream mining to iden7fy “anomalies”

•  Need to ability to express “normality” (global correctness and performance asser7ons)


Locality



19


It’s the Memory, Stupid

•  CPU performance is determined, within 10%-‐20%, by trace of memory accesses [Snavely] ☛ Algorithm design should focus on data accesses, not opera7ons – Temporal locality: cluster accesses in 7me – Spa7al locality: match data storage to access order(not vice-‐versa); use par7ally-‐constrained iterators

– Processor locality: cluster accesses in processor space


Theory Problem: CommunicaCon Complexity

•  Results exist in combinatorial & limited algebraic models (sor7ng, FFT graph, n3 matrix product…); need similar results for numerical algorithms

•  E.g., what is trade-‐off between communica7on and convergence rate in domain decomposi7on methods?


Heterogeneity


• Need for locality • Heterogeneity • Resiliency • Variability • Virtualiza7on • Socializa7on

22


Hybrid CommunicaCon

•  Mul7ple levels of caches and of cache sharing •  Different communica7on models intra and inter node

–  Coherent shared memory inside chip (node) –  rDMA (put/get/update) across nodes

•  Communica7on architecture changes every HW genera7on •  Need to easily adjust number of cores & replace inter-‐node

communica7on with intra-‐node communica7on •  Easy to “downgrade” (use shared memory for message

passing); hard to “upgrade”; hence tend to use lowest commonality (message passing)

•  No good interoperability between shared memory (e.g., OpenMP) and message passing (MPI)


Possible DirecCons

•  Express cache oblivious algorithms using recursive domain splirng (a la TBB) – Methods to (i) split domain; (ii) execute sequen7ally, if domain is “small”; and (iii) merge back

– Need adapta7on for itera7ve methods to reuse par77on

–  Leads, naturally, to algorithms where communica7on is less frequent at tree root

– May provide 2 method extensions: •  Distributed memory splirng/merging •  Shared memory splirng/merging


Hybrid ComputaCon

•  Vector/SIMD instruc7ons •  Different core types •  Accelerators

•  Can significantly reduce energy per flop •  Require (now) different source code •  Easy to compile CUDA to mul7core (downgrade) ; hard to compile general OpenMP code to GPU (upgrade)


GPU as a DisrupCve Technology

•  Disrup7ve technology: “good enough” cheaper technology that replaces be[er, more expensive one, star7ng with the low-‐end and expanding upward (Christensen) –  Kills be[er technology, before it can really replace it at the very high end – HPC is high-‐end

•  GPU is a disrup7ve technology: it will either kill/swallow the CPU or be swallowed by it

•  Probable long-‐term outcome: 7ghtly coupled cores with homogeneous architecture but heterogeneous performance that are not normally used concurrently

•  Warning: The arguments in favor of hybrid architectures, have not changed in the last 30 years.


Do You Trust Your Results?



27


Resilience •  Transient error are more frequent:

–  More transistors –  Smaller transistors –  Lower voltage –  More manufacturing variance

•  Error detec7on is expensive (e.g., nVidia vs. Power 7) •  Checkpoint/restart, as currently done, does not scale

•  Need, new, more scalable error recovery algorithms •  Supercomputers built of low-‐cost commodity

components may suffer from (too) high a rate of undetected errors. –  Will need souware error detec7on or fault-‐tolerant algorithms


Plus ca change, moins c’est la meme chose



29


Bulk Synchronous

•  Many parallel applica7ons are wri[en in a “bulk-‐synchronous style”: alterna7ng stages of local computa7on and global communica7on

•  Models implicitly assumes that all processes advance at the same compute speed

•  Assump7ons breaks down for an increasingly large number of reasons –  Black swan effect – OS ji[er – Applica7on ji[er – HW ji[er


JiZer Causes •  Black swan effect

–  If each thread is unavailable (busy) for 1 msec once a month, than most collec7ve communica7ons involving 1B threads take > 1 msec

•  OS ji[er –  Background OS ac7vi7es (daemons, heartbeats…)

•  HW ji[er –  Background error recovery ac7vi7es (e.g., memory error correc7on, memory scrubbing, reexecu7on); power management; management of manufacturing variability; degraded opera7on modes

•  Applica7on ji[er –  Input-‐dependent variability in computa7on intensity

•  Need to move away from bulk synchronous model


Possible Approaches

•  Eliminate unneeded synchroniza7ons –  Code compila7on/refactoring for added asynchrony

•  Need be[er analysis tools to iden7fy cri7cal path (“read/post early; use late” may not work)

– Dynamic scheduling (e.g., Dongarra latest LU codes) –  Virtualiza7on (e.g., Charm++)

•  Reduce needed producer-‐consumer synchroniza7ons or stretch it in 7me –  Theory ques7on: how delayed updates affect convergence rate of itera7ve solvers?


Task VirtualizaCon •  Mul7ple logical tasks are scheduled on each physical core; tasks are scheduled nonpreemp7vely; task migra7on is supported – Hides variance and communica7on latency – Helps with scalability (decouples # tasks from # cores)

– Helps with resiliency – Needed for modularity (mul7physics/mul7scale codes – handling parallel coupling of modules)

–  Improves performance (be[er locality) – Scales (Charm++/AMPI) – Can be implemented below MPI or PGAS languages

33


Task VirtualizaCon Styles

•  Varying, user controlled number of tasks (Charm++) – Locality achieved by load balancer

•  Implicit tasks: e.g., TBB

– Locality is achieved implicitly

34


On the Need for Culture Change



35


Big Systems Are Expensive

•  1% performance gain on a 4 week run = $100,000. Are we willing to invest a man-‐year to get it?

•  Would we have our undergraduate students implement a major experiment at CERN?

•  Major supercompu7ng applica7on codes should be developed by professional teams that include specialized engineers – including a performance engineer and a SW architect –  Incen7ves should encourage this model


Good Engineers Need Good Tools

Need integrated development environments •  Expert friendly tools for good engineers – a steep learning curve is

necessary (no easy way to learn brain surgery) •  Analysis, debugging and performance tools are fully integrated in

development environment at all levels of code crea7on/refactoring –  correctness/performance informa7on is presented in terms of

programmer’s interface –  compiler analysis and performance informa7on available for

refactoring •  Support a systema7c methodology for performance debugging

–  Requires a performance model •  Will not come from industry – no market – but can leverage

industrial infrastructure •  Performance programming can be made easier, but will never be

easy – we have not automated bridge building, either


InternaConal PoliCcs of SupercompuCng

•  An exascale system – Will cost ~$1B

– Will consume 20-‐50MW – May use much less commodity technology than current supercomputers

– May not have any military applica7on

•  Should supercompu7ng be done by interna7onal consor7a?

2 v3.pdf · 5 power&7&chip& power7: ibm’s next generation, balanced power server chip...

Documents