Sung Jong Lee ([email protected])
Dept. of Physics, University of Suwon
Challenges in Parallel Super-computing
2011 1st KIAS Parallel Computation Workshop
February 22-23, 2011
Brief History of Supercomputing www.top500.org Grand Challenge Problems Present Machine’s Characteristics Challenges of Exascale Supercomputing Summary References
Contents
Automobile Crash Simulations in Audi • A Virtual Car undergoes 100,000 Crash simulations (48 months) be-
fore the first prototype is built. Then real crash tests are conducted.
• Audi Supercomputer ranks 260th among top 500 supercomputers
(Nov. 2010)
Performance Measure
Megaflops (MF/s) = 106 flops
Gigaflops (GF/s) = 109 flops
Teraflops (TF/s) = 1012 flops
Petaflops (PF/s) = 1015 flops
Exaflops (EF/s) = 1018 flops
Zettaflops (ZF/s) = 1021 flops
Yottaflops (YF/s) = 1024 flops
Flops = Floating Point Operations / Second
Milestones in Supercomputing
• GigaFlops : M13 , Scientific Research Institute of Com-puter Complexes, Moscow (1984)
• TeraFlops : ASCI Red, Sandia National Lab. (1996)
• PetaFlops : Roadrunner, Los Alamos National Lab.
(2008)
History
• Alan Turing (1912-1954)
* Turing-Welchman
Bombe (1938)
* Used for Breaking
German Enigma, etc
History
• Seymour Cray (1925-1996)– Developed CDC 1604 – first fully transistorized
supercomputer (1958)– CDC 6600 (1965), 9 MFlops– Founded Cray Research in 1972
• CRAY-1 (1976), 160 MFlops• CRAY-2 (1985)• CRAY-3 (1989)
Supercomputers in USSR (Ukraine)• M13 , Scientific Research Institute of Computer Com-
plexes, Moscow (1984)• 2.4 Gigaflops • Led By Mikhail A. Kartsev - Developer of Super-Computers for
Space Observation
Mikhail A. Kartsev 1923~1983 (?)
Architectures: Shared vs. Distributed
• Easy Programming: one global memory• Bottleneck of memory
• Message Passing: Send/Recv• Scalability• Programming: not easy
Architectural transitions • Vector Processors (70s ~ 90s) : Cray-1, Cray-2 , CRAY-XMP,
CRAY_YMP,SX-2, VP-200, etc• Massively Parallel Processors (90s~2000): Cray-T3E , CM5,
VPP-500, nCUBE, SP2, PARAGON• Clusters (2000~ ) • Multicore Processors (2003 ? ~ )
Cray-1 (1976)
Installed at Los Alamos National Lab.
$ 8.8 Million
Performance: 160 Mflops,
Main Memory : 8 MB
Present Architectural Trends
* Transition to Simplicity and parallelism (driven by) three trends:
1) Single-Processor performance is no longer improving significantly
- Explicit Parallelsim is the only way to increase performance.
2) Constant field scaling has come to an end
- Threshold voltage cannot be reduced (due to leakage current)
- New processors are simpler (better performance per unit
power)
3) Increase in main memory latency and decrease in main memory
bandwidth relative to processor cycle time and execution rate
continues. Memory Bandwidth and Latency becomes the
Performance-Limiting factor !
• Multicore Processors (2003 ? ~ :
Multi-Core Processors
• Three classes of multi-core die microarchitectures
Recent Multi-core CPU’s
Tilera’s TILE Gx CPU:
• 100 Cores • Performance: 750 * 10^9 32-bit ops • Power Consumption: 10~55 W • Memory Bandwidth : 500 Gb/s
• Intel’s CPU with 48
Cores • Performance :• Power Consumption: 25 W ~ 125
W• On die Power mangement • Clock Speed : 1.66~1.83 GHz • Memory Bandwidth :
GPU
Nvidia Tesla M2050/70 GPU: • 448 CUDA Cores)• 3GB/6GB GDDR5 Memory• Power Consumption: 225 W • Memory Bandwidth• Performance: 515 Gflops (Double precision : 148 GB/s
State of the Art Summary
* 50 years reliance on von Neumann model :
1) Split between Memory and CPU’s, Sequential thread, Model of Sequential Execution
2) Memory Wall : Performance of memory has not kept up with
the improvement in CPU clock rates leading to multi-level
caches and deep memory hierarchies
Complexity increases when multiple CPU’s attempt to share
the same memory.
CPU and Memory Cycle Time Trend
* DARPA report, 2008, p103
State of the Art Summary (2)
3) Power Wall : Rise of Power as the first class constraint.
Concomitant flattening of CPU clock rates Multi-Cores .
Already several hundred cores on a die.
Expect thousands of cores on a die.
But, More cores demands more memory Bandwidth for
Memory access. But, this is not possible due to the Power concerns.
4) Attempts to modify this von Neumann model
blurring memory and processing logic there.
TOP 500 Supercomputers (Nov 2010) (http://www.top500.org)
TOP Supercomputers (Nov 2010)
• 7 Systems exceeds 1 PFlop/s• Top 10 : 0.8 PFlop/s• Top 100 : 76 TFlop/s• Top 500 : 31.1 TFlop/s
Rmax Maximal LINPACK performance achieved
Rpeak Theoretical peak performance
• Rmax= 2.57 Petaflops, 186,368 Cores, • Main Memory = 229.4 TB• 14,336 Xeon X5670 processors and
7,168 Nvidia Tesla M2050 gpGPUs. 2,048 NUDT FT1000 heterogeneous pro-cessors
• Top 2 : Jaguar, (Oakridge Lab., USA)
• Top 1 : Tianhe-1A , (NSC, China)
• Rmax= 1.76 Petaflops 224,162 Cores (Cray XT5-HE Opteron 6-core 2.6 GHz)
• Main Memory = Not Available
Clock Rate in the Top 10 Supercomputers
Processor Parallelism in the Top 10 Supercomputers
Processor Parallelism
How About Korea • Haedam (19th) & Haeon (20th) (Korea meteorological administration)
: 316.40 Tflops (45120 Cores)• TachyonII (24th) (KISTI) : 274.80 Tflops (26232 Cores)
Main Memory = 157.392 TByte
TachyonII and IBM p6
H
System Performance by Countries
Countries Share Over Time (1993~2010)
Architecture Share Over time (1993~2010)
Interconnect Family Share over Time (1993~2010)
Special Purpose Supercomputer• Anton (D. E. Shaw Research Group, 2008)• 512 processing nodes with 3D-torus hypercube topology • Each node includes a special MD engine as a single ASIC• Theoretical Maximum Performance = Flops• Net Power Consumption = KW
KIAS Cluster Case
• 418 nodes (44,826 cores) : • Theoretical Maximum Performance = 67 TeraFlops• Net Power Consumption = 154.8 KW
This includes a GPU cluster with 24 nodes (43,008 cores),
49 TFlops
Grand Challenge Problems
* Astrophysics Problems : • High Energy Physics, Nuclear Physics:• Materials science : Design of novel materials
Quantum Structure calculations • Atmospheric Science : Weather forcasting, etc • Fusion Research : Magnetohydrodynamics of Plasma,
etc • Macromolecular Structure Modeling and Dynamics :
Protein structures and folding dynamics
How does a protein physically fold from a denatured state into its native conformation?
Example : Protein Folding
?
Computational Load of Folding Simulation
By Molecular Dynamics
• Suppose a Protein + 1000 water molecules
- Approximately 3,000 atoms • Integration time step = 10-15 s• # of Long range force calculation at each time step
~ 1000*1000 = 106 • Then, one millisecond (10-3s) simulation corresponds to
1012 * 106 = 1018 calculations !!
This is an Exascale Problem !
An Example 2a: WRF (Weather Research and For-cast) model : Full Scale Nature Run
At present
(1) 5km*5km square resolution, 101 vertical levels on the
hemisphere 2*109 cells
(2) time step = ? milliseconds
--- 10 Teraflops on 10,000 5Gflops nodes (2007)
If the resolution is ~1km , then 5*1010 cells
If sustained at Exascale, it would require 10 PB of main memory
with I/O requirements up 1000 times
Types of Challenge Problems
* Parallelism :
(a) Embarrassingly Parallel Problems
(b) Coarse-Grained Problems
(c) Fine-Grained Problems
* Computation vs. Memory:
(a) CPU-intensive Problems: Molecular Dynamics
(b) Memory-Intensive Problems : Bioinformatics,
Data Analysis in Large Data Experiments
(High Energy Experiments)
* Objective: Understand the course of mainstream technology and determine the primary challenges to reaching a 1000x increase in computing capabil-ity by 2015.
DARPA Report on Exascale Computing Chal-lenges (2008)
ExaScale Computing Study:
Technology Challenges in Achieving
Exascale Systems
Peter Kogge, Editor & Study Lead
(1) The Energy and Power Challenge
(2) The Memory and Storage Challenge
(3) The Concurrency and Locality
Challenge
(4) The Resiliency Challenge
* DARPA Report 2008
Four Main Challenges
for the Exascale Supercomputing
Energy and Power Challenge
• Power / Performance (Average over Top 10 sites)
= 2.67 KW / Teraflops = 2.67 nJ/flop =2.67x10-9 J/flop
Simple Extension to Exascale: need
around 1~2 Giga-Watt for 1 Exaflops !
~ Capacity of a whole Nuclear Power Plant !!
e.g. 21 Nuclear Power plants in KOREA produced
~19 GW electricity in 2009
Power Consumption vs. Performanceof top 10 supercomputers (Nov. 2010)
Energy and Power Challenge (continued)
• Power / Performance of a Tesla GPU
approx. 229 W / 515 Gigaflops
This means around 400 MW for 1 Exaflops solely for
the processing unit alone !!!!
Consider the case of recent GPU’s Tesla 2050/70
http://www.green500.org
The Most Energy-Efficient Supercomputers (Nov. 2010)
• (1) Main Memory : Assume 1GB per chip, then 1PB = Million Chips
In Realistic Sizes of Main Memory = 10PB~ 100PB
---> 10M~100M chips !!
-- > (a) Multiple Power and Resiliency Issues (plus cost !)
(b) Bandwidth challenge : How chips are organized and
interfaced with other components.
* Need to increase memory densities and bandwidths by orders
of magnitude . • (2) Secondary Storage : Need ~100 times the Main Memory Size
(a) Bandwidth challenge
(b) Challenge of managing metadata (file descriptors, i-nodes,
file control blocks, etc)
The Memory and Storage Challenge
• DARPA Report 2008, p213~214
• Total Concurrency ≡ the total # of operations (flops) that must be initi-ated on each and every cycle. : Billion-way Concurrency Needed !!
The Concurrency Challenge
• DARPA Report 2008, p214~215
• Parallelism ≡ the number of distinct threads that makes up the execution of a program
• Present maximum ~ order of 100,000 • Need to go to 108 : order of 100~1000 times present value !
The Processor parallelism
• DARPA Report 2008, p216
• Resiliency ≡ the property of a system to continue effective operations even in the presence of faults either in hardware or software.
• More and Different forms of faults and disruptions than today’s systems
* Huge number of components : 106 to 108 memory chips
& 106 disk drives. * High clock rate increases bit error rates (BER) on data transmission.
* Aging effects in fault characteristics of devices
* Smaller feature sizes increases sensitivity of devices to SEU (Single
Event Upsets), e.g., cosmic rays, radiations
* Low operating voltage with low power increases the effect of noise
sources, like power supply
* The increased levels of concurrency increases the potential for races,
metastable states, and difficult timing problems.
The Resiliency Challenge
• DARPA Report 2008, p217
Aggressive Strawman Architecture
* DARPA Report 2008, p177
To Achieve 1 Exaflops,
* 1 Core = 4 FPU + L1 Cache Memory
* 1 Node = 742 Cores on a 4.5 Tflops, 150 Watt Active Power
Processor Chip
*1 Group = 12 Nodes +routing
* A Rack = 32 Groups
* System = 583 Racks,
* Total # of nodes ~ 223,000
* Total # of cores ~ 223,000*742
Aggressive Strawman Architecture
* DARPA Report 2008, p177
System Interconnect
• DARPA Report 2008, p 128
Interconnect bandwidth requirements for an Exascale system
Characteristics of Aggressive Strawman Architec-ture
* DARPA Report 2008, p176
Perfomance = 1 Exaflops
Total Memory = 3.6 PB
Total Power Consumption = 67.7 MW !
* Main Memory and Interconnect has important shares in power
consumption.
Power Distribution in Aggressive Strawman System
Characteristics of Aggressive Strawman Architec-ture
* DARPA Report 2008, p188
(1) Perfomance = 1 Exaflops
Total DRAM Memory = 3.6 PB
Disk Storage = 3,600 PB = 3.6 EB
Performance per Watt = 14.7 Gflops /Watt
Total # of Cores = 1.66 * 108
# of Microprocessor Chips = 223,872
Total Power Consumption = 67.7 MW !
(2) If Scaled down to 20 MW Power !
Then the Perfomance = 0.303 Exaflops = 303 Petaflops
Total DRAM Memory = 1.0 PB
Disk Storage = 1,080 PB = 1.08 EB
Total # of Cores = 5.04 * 107
# of Microprocessor Chips = 67,968
( Projected to the year of 2015)
(a) The Energy-efficient Circuits and Architecture In Silicon
* Communication Circuits & Memory circuits
(b) Alternative Low-energy Devices and Circuits for Logic and Memory
e.g., * Superconducting RSFQ (Rapid Single Flux Quantum)
Devices
* Cross-bar Architectures with Novel Bi-state devices
(c) Alternative Low-energy Systems for Memory and Storage
* New Levels in Memory Hierarchy
* Rearchitecting conventional DRAMS
(d) 3D Interconnect, Packaging and Cooling
(e) Photonic Interconnect
Possibilities of Exascale Hardware
* DARPA Report 2008
(a) Alternative Low-energy Devices and Circuits for Logic and Memory
e.g. * Superconducting RSFQ (Rapid Single Flux Quantum)
Devices : Extremely Low Power consumption
Logic Devices and Memory Devices
* DARPA Report 2008, Ch. 6
(b) Alternative Memory Types (Non-Volatile RAM’s) * Phase Change Memory (PCRAM) - Two resistance states (Crystalline vs. Amorphous) * SONOS Memory * Magnetic Random Access Memory (MRAM) - Fast Non-Volatile Memory Technology * FeRAM, Resistive RRAM
* Potential Direction for 3D Packaging (A)
3D Packaging (A)
* DARPA Report 2008, p 160
3D Packaging (B), (C)
* DARPA Report 2008, p161
• Potential Direction for 3D Packaging (B)
• Potential Direction for 3D Packaging (C)
Possible Aggressive Packaging of a Single Node
* DARPA Report 2008
Each Chip consists of 36 super-cores (6 by 6) each of which contain-ing 21 cores = 742 cores
A Strawman Design with
Optical Interconnects
* DARPA Report 2008, p191-198
Chip super-core organization and Photonic Interconnect
* On-Chip Optical Interconnect
* Off-Chip Optical Interconnect
* Rack to Rack Optical Intercon-nect
* Optically Connected Memory and Storage System
Rack to Rack Optical System Intercon-nect
* DARPA Report 2008, p195
Total Memory Power = 8.5 MW~ 12 MW
A Possible Optically Connected Memory Stack
* DARPA Report 2008, p197
(a) System Architectures and Programming Model to Reduce Com-munication
* Design in self-awareness of the status of energy usage at all
levels, and the ability to maintain a specific power level
* more explicit and dynamic program control over the contents of
memory structures (such that minimal communication energy
is expended)
* Alternative execution and programming models
(b) Locality-aware Architectures
* Optimize data placement and movement
Exascale Architectures and Programming Models
* DARPA Report 2008
Presently O(105) processors need O(108) processors
and possibly O(1010) threads in Exascale
(a) Power and Resiliency Model in Application Models
(b) Understanding Adapting Old Algorithms
(c) Inventing New Algorithms
(d) Inventing New Applications
(e) Making Applications Resiliency-Aware
Exascale Algorithm and Application De-velopment
* DARPA Report 2008
(a) Energy-efficient Error Detection and Correction Architecture
(b) Fail-in-place and Self-Healing Systems
(c) Checkpoing Roll-back and Recovery
(d) Algorithmic-level Fault Checking and Fault Resiliency
(e) Vertically-Integrated Resilient Systems
Resilient Exascale Systems
* DARPA Report 2008
Summary
• Exascale Supercomputing Requires New Technology
• Possibly Expected around ~2010
• Power Wall and Memory Wall should be overcome
• 3D Packaging and Optical Inteconnect should be pursued
• Alternative Materials for Memory and Logic Devices
- e.g., Superconducting Devices, Spintronics-based Devices
• Different Programming Model
• Reversible Logic and Computing
• TOP 500 Supercomputers (http://www.top500.org)
• Exascale Computing Study (DARPA Report, 2008)
References :
Thank You