preparing for petascale and beyond celso l. mendes parallel programming laboratory department of...
TRANSCRIPT
Preparing for Petascale and Beyond
Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
2
Presentation Outline• Present Status
– HPC Landscape, Petascale, Exascale
• Parallel Programming Lab – Mission and approach
– Programming methodology
– Scalability results for S&E applications
– Other extensions and opportunities
– Some ongoing research directions
• Happening at Illinois– BlueWaters, NCSA/IACAT
– Intel/Microsoft, NVIDIA, HP/Intel/Yahoo!, …
04/21/23 204/21/23 LNCC-08
3
Current HPC Landscape• Petascale era started!
– Roadrunner@LANL (#1 in Top500):
• Linpack: 1.026 Pflops, Peak: 1,375 Pflops
– Heterogeneous systems starting to spread (Cell, GPUs, …)
– Multicore processors widely used
– Current trends:
04/21/23
Source: top500.org
304/21/23 LNCC-08
4
Current HPC Landscape (cont.)• Processor counts:
– #1 Roadrunner@LANL: 122K
– #2 BG/L@LLNL: 212K
– #3 BG/P@ANL: 163K
• Exascale: sooner than we imagine…– U.S. Dep. of Energy town hall meetings in 2007:
• LBNL (April), ORNL (May), ANL (August)
• Goals: discuss exascale possibilities, how to accelerate it
• Sections: – Climate, Energy, Biology, Socioeconomic Modeling, Astrophysics,
Math & Algorithms, Software, Hardware
• Report: http://www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf
404/21/23 LNCC-08
5
Current HPC Landscape (cont.)• Current reality:
– Steady increase in processor counts
– Systems become multicore or heterogeneous
– “Memory wall” effects worsening
– MPI programming model still dominant
• Challenges (now and into foreseeable future):– How to explore new systems’ power
– Capacity x Capability – different problems
• Capacity is a concern for system managers
• Capability is a concern for users
– How to program in parallel effectively
• Both multicore (desktop) and million-core (supercomputers)504/21/23 LNCC-08
Parallel Programming Lab
604/21/23 LNCC-08
7
Parallel Programming Lab - PPL• http://charm.cs.uiuc.edu
• One of the largest research groups at Illinois
• Currently: – 1 faculty, 3 research scientists, 4 research programmers
– 13 grad students, 1 undergrad student
– Open positions
04/21/23
PPL, April’2008
704/21/23 LNCC-08
8
PPL Mission and Approach• To enhance Performance and Productivity in
programming complex parallel applications– Performance: scalable to thousands of processors
– Productivity: of human programmers
– Complex: irregular structure, dynamic variations
• Application-oriented yet CS-centered research– Develop enabling technology, for a wide collection of apps.
– Develop, use and test it in the context of real applications
– Embody it into easy to use abstractions
– Implementation: Charm++
• Object-oriented runtime infrastructure
• Freely available for non-commercial use04/21/23 804/21/23 LNCC-08
Application-Oriented Parallel Abstractions
NAMD Charm++
Other
Applications
Issues
Techniques & libraries
Synergy between Computer Science research and applications has been beneficial to both
ChaNGa
LeanCP Space-time meshing
Rocket Simulation
904/21/23 LNCC-08
Programming Methodology
1004/21/23 LNCC-08
Methodology: Migratable Objects
User View
System implementation
Programmer: [Over] decomposition into objects (“virtual processors” - VPs)
Runtime: Assigns VPs to real processors dynamically, during execution
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
• Software engineering– Number of virtual processors can
be independently controlled– Separate VP sets for different
modules in an application• Message driven execution
– Adaptive overlap of computation/communication
• Dynamic mapping– Heterogeneous clusters
• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load
balancing– Communication optimization
Benefits of Virtualization
1104/21/23 LNCC-08
Adaptive MPI (AMPI): MPI + Virtualization
• Each virtual process implemented as a user-level thread embedded in a Charm object– Must properly handle globals and statics (analogous to what’s needed in OpenMP)
– But… thread context-switch is much faster than other techniques
MPI processes
Real Processors
MPI “processes”
Implemented as virtual processes (user-level migratable threads)
1204/21/23 LNCC-08
Parallel Decomposition and Processors
• MPI-style:– Encourages decomposition into P pieces, where P is the
number of physical processors available
– If the natural decomposition is a cube, then the number of processors must be a cube
– Overlap of comput./communication is a user’s responsibility
• Charm++/AMPI style: “virtual processors”– Decompose into natural objects of the application
– Let the runtime map them to physical processors
– Decouple decomposition from load balancing
1304/21/23 LNCC-08
Decomposition independent of numCores
• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework
– Benefits: load balance, communication optimizations, modularity
Solid
Fluid
Solid
Fluid
Solid
Fluid. . .
1 2 P
Solid1
Fluid1
Solid2
Fluid2
Solidn
Fluidm. . .
Solid3. . .
1404/21/23 LNCC-08
Dynamic Load Balancing• Based on Principle of Persistence
– Computational loads and communication patterns tend to persist, even in dynamic computations
– Recent past is a good predictor of near future
• Implementation in Charm++:– Computational entities (nodes, structured grid points,
particles…) are partitioned into objects
– Load from objects may be measured during execution
– Objects are migrated across processors for balancing load
– Much smaller problem than repartitioning entire dataset
– Several available policies for load-balancing decisions
1504/21/23 LNCC-08
Typical Load Balancing Phases
Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing
Time
1604/21/23 LNCC-08
Examples of Science & Engineering Charm++ Applications
1704/21/23 LNCC-08
NAMD: A Production MD programNAMD
• Fully featured program
• NIH-funded development
• Distributed free of charge (~20,000 registered users)
• Binaries and source code
• Installed at NSF centers
• 20% cycles (NCSA, PSC)
• User training and support
• Large published simulations
• Gordon-Bell award in 2002• URL: www.ks.uiuc.edu/Research/namd
1804/21/23 LNCC-08
Spatial Decomposition Via Charm++
•Atoms distributed to cubes based on their location
• Size of each cube :
•Just a bit larger than cut-off radius
•Communicate only with neighbors
•Work: for each pair of nbr objects
•C/C ratio: O(1)
•However:
•Load Imbalance
•Limited Parallelism
Cells, Cubes or “Patches”
Charm++ is useful to handle this
1904/21/23 LNCC-08
Force Decomposition + Spatial Decomposition
• Now, we have many objects to apply load-balancing:
• Each diamond can be assigned to any processor
• Number of diamonds (3D):
–14 * Number of patches
• 2-away variation:
– Half-size cubes, 5x5x5 inter.
• 3-away interactions: 7x7x7
• Prototype NAMD versions created for Cell, GPUs
Object-based Parallelization for MD
2004/21/23 LNCC-08
Performance of NAMD: STMV
Number of cores
STMV: ~1 million atoms
2104/21/23 LNCC-08
222204/21/23 LNCC-08
23
ChaNGa: Cosmological Simulations
• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington
• Components: gravity (done), gas dynamics (almost)
• Barnes-Hut tree code– Particles represented hierarchically in a tree according to their
spatial position– “Pieces” of the tree distributed across processors– Gravity computation:
• “Nearby” particles: computed precisely• “Distant” particles: approximated by remote node’s center• Software-caching mechanism, critical for performance• Multi-timestepping: update frequently only the fastest
particles (see Jetley et al, IPDPS’2008)
2304/21/23 LNCC-08
ChaNGa Performance• Results obtained on BlueGene/L
• No multi-timestepping, simple load-balancers
2404/21/23 LNCC-08
Other Opportunities
2504/21/23 LNCC-08
MPI Extensions in AMPI• Automatic load balancing
– MPI_Migrate(): collective operation, possible migration
• Asynchronous collective operations– e.g. MPI_Ialltoall()
• Post operation, test/wait for completion; do work in between
• Checkpointing support– MPI_Checkpoint()
• Checkpoint into disk
– MPI_MemCheckpoint()
• Checkpoint in memory, with remote redundancy
2604/21/23 LNCC-08
Performance Tuning for Future Machines• For example, Blue Waters will arrive in 2011
– But we need to prepare applications for it, starting now
• Even for extant machines:– Full size machine may not be available as often as needed for
tuning runs
• A simulation-based approach is needed
• Our approach: BigSim– Based on Charm++ virtualization approach
– Full scale program Emulation
– Trace-driven Simulation
– History: developed for BlueGene predictions
2704/21/23 LNCC-08
BigSim Simulation System• General system organization
• Emulation: – Run an existing, full-scale MPI, AMPI or Charm++ application
– Uses an emulation layer that pretends to be (say) 100k cores
• Target cores are emulated as Charm+ virtual processors
– Resulting traces (aka logs):
• Characteristics of SEBs (Sequential Execution Blocks)
• Dependences between SEBs and messages2804/21/23 LNCC-08
BigSim Simulation System (cont.)• Trace driven parallel simulation
– Typically run on tens to hundreds of processors
– Multiple resolution simulation of sequential execution:
• from simple scaling factor to cycle-accurate modeling
– Multiple resolution simulation of the Network:
• from simple latency/bw model to detailed packet and switching port level modeling
– Generates Timing traces just as a real app on full scale machine
• Phase 3: Analyze performance– Identify bottlenecks, even w/o predicting exact performance
– Carry out various “what-if” analysis
2904/21/23 LNCC-08
Projections: Performance Visualization
3004/21/23 LNCC-08
BigSim Validation: BG/L Predictions
NAMD Apoa1
0
20
40
60
80
128 256 512 1024 2250
number of processors simulated
time (se
conds)
Actualexecutiontimepredictedtime
3104/21/23 LNCC-08
Some Ongoing Research Directions
3204/21/23 LNCC-08
Load Balancing for Large Machines: I• Centralized balancers achieve best balance
– Collect object-communication graph on one processor
– But won’t scale beyond tens of thousands of nodes
• Fully distributed load balancers– Avoid bottleneck but.. Achieve poor load balance
– Not adequately agile
• Hierarchical load balancers– Careful control of what information goes up and down the
hierarchy can lead to fast, high-quality balancers
3304/21/23 LNCC-08
Load Balancing for Large Machines: II• Interconnection topology starts to matter again
– Was hidden due to wormhole routing etc.
– Latency variation is still small...
– But bandwidth occupancy (link contention) is a problem
• Topology aware load balancers– Some general heuristics have shown good performance
• But may require too much compute power
– Also, special-purpose heuristic work fine when applicable
– Preliminary results:
• see Bhatele & Kale’s paper, LSPP@IPDPS’2008
– Still, many open challenges
3404/21/23 LNCC-08
Major Challenges in Applications• NAMD:
– Scalable PME (long range forces) – 3D FFT
• Specialized balancers for multi-resolution cases– Ex: ChaNGa running highly-clustered cosmological datasets
and multi-timestepping
proc
esso
r
Time Black: Processor Activity
(a) Singlestepping (b) Multi-timestepping (c) Multi-timestepping + special load-balancing
3504/21/23 LNCC-08
BigSim: Challenges• BigSim’s simple diagram hides many complexities
• Emulation: – Automatic Out-of-core support for large memory footprint apps
• Simulation:– Accuracy vs cost tradeoffs– Interpolation mechanisms for prediction of serial performance– Memory management optimizations– I/O optimizations for handling (many) large trace files
• Performance analysis:– Need scalable tools
• Active area of research3604/21/23 LNCC-08
Fault Tolerance• Automatic Checkpointing
– Migrate objects to disk
– In-memory checkpointing as an option
– Both schemes above are available in Charm++
• Proactive Fault Handling– Migrate objects to other processors
upon detecting imminent fault
– Adjust processor-level parallel data structures
– Rebalance load after migrations
– HiPC’07 paper: Chakravorty et al
• Scalable fault tolerance– When a processor out of 100,000
fails, all 99,999 shouldn’t have to run back to their checkpoints!
– Sender-side message logging
– Restart can be speeded up by spreading out objects from failed processor
– IPDPS’07 paper: Chakravorty & Kale
– Ongoing effort to minimize logging protocol overheads
3704/21/23 LNCC-08
Higher Level Languages & Interoperability
3804/21/23 LNCC-08
HPC at Illinois
3904/21/23 LNCC-08
40
HPC at Illinois• Many other exciting developments
– Microsoft/Intel parallel computing research center
– Parallel Programming Classes
• CS-420: Parallel Programming Sci. and Enginnering
• ECE-498: NVIDIA/ECE collaboration
– HP/Intel/Yahoo! Institute
– NCSA’s Blue Waters system approved for 2011
• see http://www.ncsa.uiuc.edu/BlueWaters/
– NCSA/IACAT new institute
• see http://www.iacat.uiuc.edu/
04/21/23 4004/21/23 LNCC-08
41
Microsoft/Intel UPCRC• Universal Parallel Computing Research Center
• 5 year funding, 2 centers:– Univ.Illinois & Univ.Cal.-Berkeley
• Joint effort by Intel/Microsoft: $2M/year
• Mission:– Conduct research to make parallel programming broadly
accessible and “easy”
• Focus areas:– Programming, Translation, Execution, Applications
• URL: http://www.upcrc.illinois.edu/
04/21/23 4104/21/23 LNCC-08
42
Parallel Programming Classes• CS-420: Parallel Programming
– Introduction to fundamental issues in parallelism
– Students from both CS and other engineering areas
– Offered every semester, by CS Profs. Kale or Padua
• ECE-498: Progr. Massively Parallel Processors– Focus on GPU programming techniques
– ECE Prof. Wen-Mei Hwu
– NVIDIA’s Chief Scientist David Kirk– URL: http://courses.ece.uiuc.edu/ece498/al1
04/21/23 4204/21/23 LNCC-08
43
HP/Intel/Yahoo! Initiative • Cloud Computing Testbed - worldwide
• Goal: – Study Internet-scale systems, focusing on data-intensive
applications using distributed computational resources
• Areas of study: – Networking, OS, virtual machines, distributed systems, data-
mining, Web search, network measurement, and multimedia
• Illinois/CS testbed site:– 1,024-core HP system with 200 TB of disk space
– External access via an upcoming proposal selection process
• URL: http://www.hp.com/hpinfo/newsroom/press/2008/080729xa.html
04/21/23 4304/21/23 LNCC-08
Our Sponsors
4404/21/23 LNCC-08
45
PPL Funding Sources • National Science Foundation
– BigSim, Cosmology, Languages
• Dep. of Energy– Charm++ (Load-Bal., Fault-Toler.), Quantum Chemistry
• National Institutes of Health– NAMD
• NCSA/NSF, NCSA/IACAT– Blue Waters project, applications
• Dep. of Energy / UIUC Rocket Center– AMPI, applications
• Nasa– Cosmology/Visualization
04/21/23 4504/21/23 LNCC-08
Obrigado !
4604/21/23 LNCC-08