preparing for petascale and beyond celso l. mendes parallel programming laboratory department of...

Preparing for Petascale and Beyond

Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

2

Presentation Outline• Present Status

– HPC Landscape, Petascale, Exascale

• Parallel Programming Lab – Mission and approach

– Programming methodology

– Scalability results for S&E applications

– Other extensions and opportunities

– Some ongoing research directions

• Happening at Illinois– BlueWaters, NCSA/IACAT

– Intel/Microsoft, NVIDIA, HP/Intel/Yahoo!, …

04/21/23 204/21/23 LNCC-08

3

Current HPC Landscape• Petascale era started!

– Roadrunner@LANL (#1 in Top500):

• Linpack: 1.026 Pflops, Peak: 1,375 Pflops

– Heterogeneous systems starting to spread (Cell, GPUs, …)

– Multicore processors widely used

– Current trends:

04/21/23

Source: top500.org

304/21/23 LNCC-08

4

Current HPC Landscape (cont.)• Processor counts:

– #1 Roadrunner@LANL: 122K

– #2 BG/L@LLNL: 212K

– #3 BG/P@ANL: 163K

• Exascale: sooner than we imagine…– U.S. Dep. of Energy town hall meetings in 2007:

• LBNL (April), ORNL (May), ANL (August)

• Goals: discuss exascale possibilities, how to accelerate it

• Sections: – Climate, Energy, Biology, Socioeconomic Modeling, Astrophysics,

Math & Algorithms, Software, Hardware

• Report: http://www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf

404/21/23 LNCC-08

5

Current HPC Landscape (cont.)• Current reality:

– Steady increase in processor counts

– Systems become multicore or heterogeneous

– “Memory wall” effects worsening

– MPI programming model still dominant

• Challenges (now and into foreseeable future):– How to explore new systems’ power

– Capacity x Capability – different problems

• Capacity is a concern for system managers

• Capability is a concern for users

– How to program in parallel effectively

• Both multicore (desktop) and million-core (supercomputers)504/21/23 LNCC-08

Parallel Programming Lab

604/21/23 LNCC-08

7

Parallel Programming Lab - PPL• http://charm.cs.uiuc.edu

• One of the largest research groups at Illinois

• Currently: – 1 faculty, 3 research scientists, 4 research programmers

– 13 grad students, 1 undergrad student

– Open positions

04/21/23

PPL, April’2008

704/21/23 LNCC-08

8

PPL Mission and Approach• To enhance Performance and Productivity in

programming complex parallel applications– Performance: scalable to thousands of processors

– Productivity: of human programmers

– Complex: irregular structure, dynamic variations

• Application-oriented yet CS-centered research– Develop enabling technology, for a wide collection of apps.

– Develop, use and test it in the context of real applications

– Embody it into easy to use abstractions

– Implementation: Charm++

• Object-oriented runtime infrastructure

• Freely available for non-commercial use04/21/23 804/21/23 LNCC-08

Application-Oriented Parallel Abstractions

NAMD Charm++

Other

Applications

Issues

Techniques & libraries

Synergy between Computer Science research and applications has been beneficial to both

ChaNGa

LeanCP Space-time meshing

Rocket Simulation

904/21/23 LNCC-08

Programming Methodology

1004/21/23 LNCC-08

Methodology: Migratable Objects

User View

System implementation

Programmer: [Over] decomposition into objects (“virtual processors” - VPs)

Runtime: Assigns VPs to real processors dynamically, during execution

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

• Software engineering– Number of virtual processors can

be independently controlled– Separate VP sets for different

modules in an application• Message driven execution

– Adaptive overlap of computation/communication

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load

balancing– Communication optimization

Benefits of Virtualization

1104/21/23 LNCC-08

Adaptive MPI (AMPI): MPI + Virtualization

• Each virtual process implemented as a user-level thread embedded in a Charm object– Must properly handle globals and statics (analogous to what’s needed in OpenMP)

– But… thread context-switch is much faster than other techniques

MPI processes

Real Processors

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

1204/21/23 LNCC-08

Parallel Decomposition and Processors

• MPI-style:– Encourages decomposition into P pieces, where P is the

number of physical processors available

– If the natural decomposition is a cube, then the number of processors must be a cube

– Overlap of comput./communication is a user’s responsibility

• Charm++/AMPI style: “virtual processors”– Decompose into natural objects of the application

– Let the runtime map them to physical processors

– Decouple decomposition from load balancing

1304/21/23 LNCC-08

Decomposition independent of numCores

• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

– Benefits: load balance, communication optimizations, modularity

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .

1404/21/23 LNCC-08

Dynamic Load Balancing• Based on Principle of Persistence

– Computational loads and communication patterns tend to persist, even in dynamic computations

– Recent past is a good predictor of near future

• Implementation in Charm++:– Computational entities (nodes, structured grid points,

particles…) are partitioned into objects

– Load from objects may be measured during execution

– Objects are migrated across processors for balancing load

– Much smaller problem than repartitioning entire dataset

– Several available policies for load-balancing decisions

1504/21/23 LNCC-08

Typical Load Balancing Phases

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

Time

1604/21/23 LNCC-08

Examples of Science & Engineering Charm++ Applications

1704/21/23 LNCC-08

NAMD: A Production MD programNAMD

• Fully featured program

• NIH-funded development

• Distributed free of charge (~20,000 registered users)

• Binaries and source code

• Installed at NSF centers

• 20% cycles (NCSA, PSC)

• User training and support

• Large published simulations

• Gordon-Bell award in 2002• URL: www.ks.uiuc.edu/Research/namd

1804/21/23 LNCC-08

Spatial Decomposition Via Charm++

•Atoms distributed to cubes based on their location

• Size of each cube :

•Just a bit larger than cut-off radius

•Communicate only with neighbors

•Work: for each pair of nbr objects

•C/C ratio: O(1)

•However:

•Load Imbalance

•Limited Parallelism

Cells, Cubes or “Patches”

Charm++ is useful to handle this

1904/21/23 LNCC-08

Force Decomposition + Spatial Decomposition

• Now, we have many objects to apply load-balancing:

• Each diamond can be assigned to any processor

• Number of diamonds (3D):

–14 * Number of patches

• 2-away variation:

– Half-size cubes, 5x5x5 inter.

• 3-away interactions: 7x7x7

• Prototype NAMD versions created for Cell, GPUs

Object-based Parallelization for MD

2004/21/23 LNCC-08

Performance of NAMD: STMV

Number of cores

STMV: ~1 million atoms

2104/21/23 LNCC-08

222204/21/23 LNCC-08

23

ChaNGa: Cosmological Simulations

• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington

• Components: gravity (done), gas dynamics (almost)

• Barnes-Hut tree code– Particles represented hierarchically in a tree according to their

spatial position– “Pieces” of the tree distributed across processors– Gravity computation:

• “Nearby” particles: computed precisely• “Distant” particles: approximated by remote node’s center• Software-caching mechanism, critical for performance• Multi-timestepping: update frequently only the fastest

particles (see Jetley et al, IPDPS’2008)

2304/21/23 LNCC-08

ChaNGa Performance• Results obtained on BlueGene/L

• No multi-timestepping, simple load-balancers

2404/21/23 LNCC-08

Other Opportunities

2504/21/23 LNCC-08

MPI Extensions in AMPI• Automatic load balancing

– MPI_Migrate(): collective operation, possible migration

• Asynchronous collective operations– e.g. MPI_Ialltoall()

• Post operation, test/wait for completion; do work in between

• Checkpointing support– MPI_Checkpoint()

• Checkpoint into disk

– MPI_MemCheckpoint()

• Checkpoint in memory, with remote redundancy

2604/21/23 LNCC-08

Performance Tuning for Future Machines• For example, Blue Waters will arrive in 2011

– But we need to prepare applications for it, starting now

• Even for extant machines:– Full size machine may not be available as often as needed for

tuning runs

• A simulation-based approach is needed

• Our approach: BigSim– Based on Charm++ virtualization approach

– Full scale program Emulation

– Trace-driven Simulation

– History: developed for BlueGene predictions

2704/21/23 LNCC-08

BigSim Simulation System• General system organization

• Emulation: – Run an existing, full-scale MPI, AMPI or Charm++ application

– Uses an emulation layer that pretends to be (say) 100k cores

• Target cores are emulated as Charm+ virtual processors

– Resulting traces (aka logs):

• Characteristics of SEBs (Sequential Execution Blocks)

• Dependences between SEBs and messages2804/21/23 LNCC-08

BigSim Simulation System (cont.)• Trace driven parallel simulation

– Typically run on tens to hundreds of processors

– Multiple resolution simulation of sequential execution:

• from simple scaling factor to cycle-accurate modeling

– Multiple resolution simulation of the Network:

• from simple latency/bw model to detailed packet and switching port level modeling

– Generates Timing traces just as a real app on full scale machine

• Phase 3: Analyze performance– Identify bottlenecks, even w/o predicting exact performance

– Carry out various “what-if” analysis

2904/21/23 LNCC-08

Projections: Performance Visualization

3004/21/23 LNCC-08

BigSim Validation: BG/L Predictions

NAMD Apoa1

0

20

40

60

80

128 256 512 1024 2250

number of processors simulated

time (se

conds)

Actualexecutiontimepredictedtime

3104/21/23 LNCC-08

Some Ongoing Research Directions

3204/21/23 LNCC-08

Load Balancing for Large Machines: I• Centralized balancers achieve best balance

– Collect object-communication graph on one processor

– But won’t scale beyond tens of thousands of nodes

• Fully distributed load balancers– Avoid bottleneck but.. Achieve poor load balance

– Not adequately agile

• Hierarchical load balancers– Careful control of what information goes up and down the

hierarchy can lead to fast, high-quality balancers

3304/21/23 LNCC-08

Load Balancing for Large Machines: II• Interconnection topology starts to matter again

– Was hidden due to wormhole routing etc.

– Latency variation is still small...

– But bandwidth occupancy (link contention) is a problem

• Topology aware load balancers– Some general heuristics have shown good performance

• But may require too much compute power

– Also, special-purpose heuristic work fine when applicable

– Preliminary results:

• see Bhatele & Kale’s paper, LSPP@IPDPS’2008

– Still, many open challenges

3404/21/23 LNCC-08

Major Challenges in Applications• NAMD:

– Scalable PME (long range forces) – 3D FFT

• Specialized balancers for multi-resolution cases– Ex: ChaNGa running highly-clustered cosmological datasets

and multi-timestepping

proc

esso

r

Time Black: Processor Activity

(a) Singlestepping (b) Multi-timestepping (c) Multi-timestepping + special load-balancing

3504/21/23 LNCC-08

BigSim: Challenges• BigSim’s simple diagram hides many complexities

• Emulation: – Automatic Out-of-core support for large memory footprint apps

• Simulation:– Accuracy vs cost tradeoffs– Interpolation mechanisms for prediction of serial performance– Memory management optimizations– I/O optimizations for handling (many) large trace files

• Performance analysis:– Need scalable tools

• Active area of research3604/21/23 LNCC-08

Fault Tolerance• Automatic Checkpointing

– Migrate objects to disk

– In-memory checkpointing as an option

– Both schemes above are available in Charm++

• Proactive Fault Handling– Migrate objects to other processors

upon detecting imminent fault

– Adjust processor-level parallel data structures

– Rebalance load after migrations

– HiPC’07 paper: Chakravorty et al

• Scalable fault tolerance– When a processor out of 100,000

fails, all 99,999 shouldn’t have to run back to their checkpoints!

– Sender-side message logging

– Restart can be speeded up by spreading out objects from failed processor

– IPDPS’07 paper: Chakravorty & Kale

– Ongoing effort to minimize logging protocol overheads

3704/21/23 LNCC-08

Higher Level Languages & Interoperability

3804/21/23 LNCC-08

HPC at Illinois

3904/21/23 LNCC-08

40

HPC at Illinois• Many other exciting developments

– Microsoft/Intel parallel computing research center

– Parallel Programming Classes

• CS-420: Parallel Programming Sci. and Enginnering

• ECE-498: NVIDIA/ECE collaboration

– HP/Intel/Yahoo! Institute

– NCSA’s Blue Waters system approved for 2011

• see http://www.ncsa.uiuc.edu/BlueWaters/

– NCSA/IACAT new institute

• see http://www.iacat.uiuc.edu/

04/21/23 4004/21/23 LNCC-08

41

Microsoft/Intel UPCRC• Universal Parallel Computing Research Center

• 5 year funding, 2 centers:– Univ.Illinois & Univ.Cal.-Berkeley

• Joint effort by Intel/Microsoft: $2M/year

• Mission:– Conduct research to make parallel programming broadly

accessible and “easy”

• Focus areas:– Programming, Translation, Execution, Applications

• URL: http://www.upcrc.illinois.edu/

04/21/23 4104/21/23 LNCC-08

42

Parallel Programming Classes• CS-420: Parallel Programming

– Introduction to fundamental issues in parallelism

– Students from both CS and other engineering areas

– Offered every semester, by CS Profs. Kale or Padua

• ECE-498: Progr. Massively Parallel Processors– Focus on GPU programming techniques

– ECE Prof. Wen-Mei Hwu

– NVIDIA’s Chief Scientist David Kirk– URL: http://courses.ece.uiuc.edu/ece498/al1

04/21/23 4204/21/23 LNCC-08

43

HP/Intel/Yahoo! Initiative • Cloud Computing Testbed - worldwide

• Goal: – Study Internet-scale systems, focusing on data-intensive

applications using distributed computational resources

• Areas of study: – Networking, OS, virtual machines, distributed systems, data-

mining, Web search, network measurement, and multimedia

• Illinois/CS testbed site:– 1,024-core HP system with 200 TB of disk space

– External access via an upcoming proposal selection process

• URL: http://www.hp.com/hpinfo/newsroom/press/2008/080729xa.html

04/21/23 4304/21/23 LNCC-08

Our Sponsors

4404/21/23 LNCC-08

45

PPL Funding Sources • National Science Foundation

– BigSim, Cosmology, Languages

• Dep. of Energy– Charm++ (Load-Bal., Fault-Toler.), Quantum Chemistry

• National Institutes of Health– NAMD

• NCSA/NSF, NCSA/IACAT– Blue Waters project, applications

• Dep. of Energy / UIUC Rocket Center– AMPI, applications

• Nasa– Cosmology/Visualization

04/21/23 4504/21/23 LNCC-08

Obrigado !

4604/21/23 LNCC-08

preparing for petascale and beyond celso l. mendes parallel programming laboratory department of...

Documents

multicore processors

research scientists

current hpc landscape

computer science research

real processors

largest research groups

adaptive mpi ampi

current reality