2 v3.pdf · 5 power&7&chip& power7: ibm’s next generation, balanced power server chip...
TRANSCRIPT
2 www.parallel.illinois.edu
Parallel @ Illinois
Illiac IV
UPCRC
Cloud Computing Testbed
OpenSparc Center of Excellence
CUDA Center of Excellence
Extreme Scale Computing
3 www.parallel.illinois.edu
Blue Waters
• Sustained petaflop/s on complex applica7ons (QCD, turbulence, molecular dynamics,…)
• > 200,000 cores • > 800 TB memory
• >10 PB disk • > 500 PB tape • 100-‐400 Gbps external BW
• IBM Power 7 technology
4 www.parallel.illinois.edu
POWER7: IBM’s Next Generation, Balanced POWER Server Chip
7
POWER7: Core Execution Units
2 Fixed point units2 Load store units4 Double precision floating point1 Branch1 Condition register 1 Vector unit1 Decimal floating point unit6 wide dispatch
Recovery Function Distributed1,2,4 Way SMT SupportOut of Order Execution32KB I-Cache32KB D-Cache256KB L2
Tightly coupled to core
Add Boxes
256KB L2
IFUCRU/BRU
ISU
DFU
FXU
VSXFPU
LSU
Hot Chip IBM Presenta7on
5 www.parallel.illinois.edu
Power 7 Chip POWER7: IBM’s Next Generation, Balanced POWER Server Chip
4
POWER7 Processor Chip
567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM1.2B transistors
Equivalent function of 2.7BeDRAM efficiency
Eight processor cores12 execution units per core4 Way SMT per core32 Threads per chip256KB L2 per core
32MB on chip eDRAM shared L3Dual DDR3 Memory Controllers
100GB/s Memory bandwidth per chip sustained
Scalability up to 32 Sockets360GB/s SMP bandwidth/chip20,000 coherent operations in flight
Advanced pre-fetching Data and InstructionBinary Compatibility with POWER6
* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.
6 www.parallel.illinois.edu
Possible Power 7 Package
POWER7: IBM’s Next Generation, Balanced POWER Server Chip
6
POWER7 Design Principles:
Cores:8, 6, and 4-core offerings with up to 32MB of L3 CacheDynamically turn cores on and off, reallocating energyDynamically vary individual core frequencies, reallocating energyDynamically enable and disable up to 4 threads per core
Memory Subsystem:Full 8 channel or reduced 4 channel configurations
System Topologies:Standard, half-width, and double-width SMP busses supported
Multiple System Packages
!"#$%&%"%'()*+,)-,*.'*&%"%'(
2/4s Blades and RacksSingle Chip Organic
High-End and Mid-RangeSingle Chip Glass Ceramic
Compute IntensiveQuad-chip MCM
/)0#123()42+'32""#35)67)"28*")"%+9: ;)0#123()42+'32""#3:
5)<7)"28*")"%+9:;)<7)=#12'#)"%+9:
<)0#123()42+'32""#3:5)/>7)"28*")"%+9:)?2+)040@
* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.
7 www.parallel.illinois.edu
Performance growths 1,000-‐fold every 11 years
(Kogge)
Can we achieve the next jump?
8 www.parallel.illinois.edu
Moore’s Law Con7nues Moore’s Law is Alive and Well
3
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (in Thousands)
Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Will continue in coming decade
(Olokotum)
9 www.parallel.illinois.edu
Clock Frequency Stagnant But Clock Frequency Scaling
Replaced by Scaling Cores / Chip
4
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (in Thousands) Frequency (MHz) Cores
Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç
15 Years of exponential growth ~2x year has ended
(Olokotum)
10 www.parallel.illinois.edu
Performance Has Also Slowed, Along with Power
5
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (in Thousands)
Frequency (MHz)
Power (W)
Perf
Cores
Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç
Power is the root cause of all this
Future increases in performance will come only from increases in number of concurrent threads
End of Single-‐Thread Era Li[le/no benefit from increased transistor count Decreasing benefit from frequency increases Power limits reached
11 www.parallel.illinois.edu
Number of Cores Increases Rapidly This has Also Impacted
HPC System Concurrency
Exponential wave of increasing concurrency for forseeable future! 1M cores sooner than you think!
6
Sum of the # of cores in top 15 systems (from top500.org)
Sum # cores top 15 systems
A million cores in a couple of years; a billion threads in a decade?
12 www.parallel.illinois.edu
Power Budget
!
!""#$%#&'%()*+!"",
-.",
-"",
-"",
//.",
0&123&%453%1256'278'459746:259
+2;&'%9<##7=%7299&9
.>,
"#$%&'(
)($#*+
"#$
,-./
8()*+%?41@:5&%6234=
-"8A%3:9B%C%-8AD3:9B%C-",
"E-AD()*+%C%-E.5$%#&'%A=6&
-""#$%12F%#&'%()*+
0123(*452)126('452712894:5 GW for Exaflop/s with today’s technology 100 MW in a decade?
(Borkar)
13 www.parallel.illinois.edu
Aggressive Power Scaling
!
The Power & Energy Challenge
!""#
$%"#
$""#
$""#
&%%"#
%'#
"#$%&'(
)($#*+
"#$
,-./
()*+,-./01234-567/8
%#!#9%#9:#%#
()*+,-./01234-5143#251-;</-(40136=6>8
0123
~1B threads Heterogeneous architecture Mostly nearest-neighbor communication Long “cache lines” High error-rates
(Borkar)
20 MW for Exaflop/s with aggressive, likely custom design; and harder to program machine Software to the rescue!
14 www.parallel.illinois.edu
Main Issues
• Increased parallelism
• Need for locality • Heterogeneity • Resilience • Variability • Virtualiza7on • Socializa7on
14
15 www.parallel.illinois.edu
Managing 1B threads
• Increased parallelism
• Need for locality • Heterogeneity • Resilience • Variability • Virtualiza7on • Socializa7on
15
16 www.parallel.illinois.edu
Scaling ApplicaCons
Weak scaling: use more powerful machine to solve larger problem – increase applica7on size and keep running 7me constant; e.g., refine grid
• Larger problem may not be of interest – Iden7fy problems that require petascale performance
• May want to scale 7me, not space (e.g., molecular dynamics) – Study parallelism in 7me domain
• Cannot scale space without scaling 7me (itera7ve methods): granularity decreases and communica7on increases
17 www.parallel.illinois.edu
Scaling IteraCve Methods • Assume that number of cores (and compute power)
increases by factor of k
• Space and 7me scales are refined by factor of k1/4
• Mesh size increases by factor of k3/4
• Per core cell volume decreases by factor of k1/4
• Per core cell area decreases by a factor of k1/4×2/3 = k1/6
• Area to volume raCo (communica7on to computa7on ra7o) increases by factor of k1/4/ k1/6 = k1/12
• Per core computa7on is finer grained and needs rela7vely more communica7on
• (Per chip computa7on is coarser grained and and needs rela7vely less communica7on if most increase in # cores is per chip)
18 www.parallel.illinois.edu
Debugging and Tuning: Observing 1B Threads
• Scalable infrastructure to control and instrument 1B threads
• On-‐the-‐fly sensor data stream mining to iden7fy “anomalies”
• Need to ability to express “normality” (global correctness and performance asser7ons)
19 www.parallel.illinois.edu
Locality
• Increased parallelism
• Need for locality • Heterogeneity • Resilience • Variability • Virtualiza7on • Socializa7on
19
20 www.parallel.illinois.edu
It’s the Memory, Stupid
• CPU performance is determined, within 10%-‐20%, by trace of memory accesses [Snavely] ☛ Algorithm design should focus on data accesses, not opera7ons – Temporal locality: cluster accesses in 7me – Spa7al locality: match data storage to access order(not vice-‐versa); use par7ally-‐constrained iterators
– Processor locality: cluster accesses in processor space
21 www.parallel.illinois.edu
Theory Problem: CommunicaCon Complexity
• Results exist in combinatorial & limited algebraic models (sor7ng, FFT graph, n3 matrix product…); need similar results for numerical algorithms
• E.g., what is trade-‐off between communica7on and convergence rate in domain decomposi7on methods?
22 www.parallel.illinois.edu
Heterogeneity
• Increased parallelism
• Need for locality • Heterogeneity • Resiliency • Variability • Virtualiza7on • Socializa7on
22
23 www.parallel.illinois.edu
Hybrid CommunicaCon
• Mul7ple levels of caches and of cache sharing • Different communica7on models intra and inter node
– Coherent shared memory inside chip (node) – rDMA (put/get/update) across nodes
• Communica7on architecture changes every HW genera7on • Need to easily adjust number of cores & replace inter-‐node
communica7on with intra-‐node communica7on • Easy to “downgrade” (use shared memory for message
passing); hard to “upgrade”; hence tend to use lowest commonality (message passing)
• No good interoperability between shared memory (e.g., OpenMP) and message passing (MPI)
24 www.parallel.illinois.edu
Possible DirecCons
• Express cache oblivious algorithms using recursive domain splirng (a la TBB) – Methods to (i) split domain; (ii) execute sequen7ally, if domain is “small”; and (iii) merge back
– Need adapta7on for itera7ve methods to reuse par77on
– Leads, naturally, to algorithms where communica7on is less frequent at tree root
– May provide 2 method extensions: • Distributed memory splirng/merging • Shared memory splirng/merging
25 www.parallel.illinois.edu
Hybrid ComputaCon
• Vector/SIMD instruc7ons • Different core types • Accelerators
• Can significantly reduce energy per flop • Require (now) different source code • Easy to compile CUDA to mul7core (downgrade) ; hard to compile general OpenMP code to GPU (upgrade)
26 www.parallel.illinois.edu
GPU as a DisrupCve Technology
• Disrup7ve technology: “good enough” cheaper technology that replaces be[er, more expensive one, star7ng with the low-‐end and expanding upward (Christensen) – Kills be[er technology, before it can really replace it at the very high end – HPC is high-‐end
• GPU is a disrup7ve technology: it will either kill/swallow the CPU or be swallowed by it
• Probable long-‐term outcome: 7ghtly coupled cores with homogeneous architecture but heterogeneous performance that are not normally used concurrently
• Warning: The arguments in favor of hybrid architectures, have not changed in the last 30 years.
27 www.parallel.illinois.edu
Do You Trust Your Results?
• Increased parallelism
• Need for locality • Heterogeneity • Resilience • Variability • Virtualiza7on • Socializa7on
27
28 www.parallel.illinois.edu
Resilience • Transient error are more frequent:
– More transistors – Smaller transistors – Lower voltage – More manufacturing variance
• Error detec7on is expensive (e.g., nVidia vs. Power 7) • Checkpoint/restart, as currently done, does not scale
• Need, new, more scalable error recovery algorithms • Supercomputers built of low-‐cost commodity
components may suffer from (too) high a rate of undetected errors. – Will need souware error detec7on or fault-‐tolerant algorithms
29 www.parallel.illinois.edu
Plus ca change, moins c’est la meme chose
• Increased parallelism
• Need for locality • Heterogeneity • Resilience • Variability • Virtualiza7on • Socializa7on
29
30 www.parallel.illinois.edu
Bulk Synchronous
• Many parallel applica7ons are wri[en in a “bulk-‐synchronous style”: alterna7ng stages of local computa7on and global communica7on
• Models implicitly assumes that all processes advance at the same compute speed
• Assump7ons breaks down for an increasingly large number of reasons – Black swan effect – OS ji[er – Applica7on ji[er – HW ji[er
31 www.parallel.illinois.edu
JiZer Causes • Black swan effect
– If each thread is unavailable (busy) for 1 msec once a month, than most collec7ve communica7ons involving 1B threads take > 1 msec
• OS ji[er – Background OS ac7vi7es (daemons, heartbeats…)
• HW ji[er – Background error recovery ac7vi7es (e.g., memory error correc7on, memory scrubbing, reexecu7on); power management; management of manufacturing variability; degraded opera7on modes
• Applica7on ji[er – Input-‐dependent variability in computa7on intensity
• Need to move away from bulk synchronous model
32 www.parallel.illinois.edu
Possible Approaches
• Eliminate unneeded synchroniza7ons – Code compila7on/refactoring for added asynchrony
• Need be[er analysis tools to iden7fy cri7cal path (“read/post early; use late” may not work)
– Dynamic scheduling (e.g., Dongarra latest LU codes) – Virtualiza7on (e.g., Charm++)
• Reduce needed producer-‐consumer synchroniza7ons or stretch it in 7me – Theory ques7on: how delayed updates affect convergence rate of itera7ve solvers?
33 www.parallel.illinois.edu
Task VirtualizaCon • Mul7ple logical tasks are scheduled on each physical core; tasks are scheduled nonpreemp7vely; task migra7on is supported – Hides variance and communica7on latency – Helps with scalability (decouples # tasks from # cores)
– Helps with resiliency – Needed for modularity (mul7physics/mul7scale codes – handling parallel coupling of modules)
– Improves performance (be[er locality) – Scales (Charm++/AMPI) – Can be implemented below MPI or PGAS languages
33
34 www.parallel.illinois.edu
Task VirtualizaCon Styles
• Varying, user controlled number of tasks (Charm++) – Locality achieved by load balancer
• Implicit tasks: e.g., TBB
– Locality is achieved implicitly
34
35 www.parallel.illinois.edu
On the Need for Culture Change
• Increased parallelism
• Need for locality • Heterogeneity • Resilience • Variability • Virtualiza7on • Socializa7on
35
36 www.parallel.illinois.edu
Big Systems Are Expensive
• 1% performance gain on a 4 week run = $100,000. Are we willing to invest a man-‐year to get it?
• Would we have our undergraduate students implement a major experiment at CERN?
• Major supercompu7ng applica7on codes should be developed by professional teams that include specialized engineers – including a performance engineer and a SW architect – Incen7ves should encourage this model
37 www.parallel.illinois.edu
Good Engineers Need Good Tools
Need integrated development environments • Expert friendly tools for good engineers – a steep learning curve is
necessary (no easy way to learn brain surgery) • Analysis, debugging and performance tools are fully integrated in
development environment at all levels of code crea7on/refactoring – correctness/performance informa7on is presented in terms of
programmer’s interface – compiler analysis and performance informa7on available for
refactoring • Support a systema7c methodology for performance debugging
– Requires a performance model • Will not come from industry – no market – but can leverage
industrial infrastructure • Performance programming can be made easier, but will never be
easy – we have not automated bridge building, either
38 www.parallel.illinois.edu
InternaConal PoliCcs of SupercompuCng
• An exascale system – Will cost ~$1B
– Will consume 20-‐50MW – May use much less commodity technology than current supercomputers
– May not have any military applica7on
• Should supercompu7ng be done by interna7onal consor7a?
39 www.parallel.illinois.edu