preparing for petascale and beyond celso l. mendes parallel programming laboratory department of...

46
Preparing for Petascale and Beyond Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

Upload: walter-jordan

Post on 17-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Preparing for Petascale and Beyond

Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

Page 2: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

2

Presentation Outline• Present Status

– HPC Landscape, Petascale, Exascale

• Parallel Programming Lab – Mission and approach

– Programming methodology

– Scalability results for S&E applications

– Other extensions and opportunities

– Some ongoing research directions

• Happening at Illinois– BlueWaters, NCSA/IACAT

– Intel/Microsoft, NVIDIA, HP/Intel/Yahoo!, …

04/21/23 204/21/23 LNCC-08

Page 3: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

3

Current HPC Landscape• Petascale era started!

– Roadrunner@LANL (#1 in Top500):

• Linpack: 1.026 Pflops, Peak: 1,375 Pflops

– Heterogeneous systems starting to spread (Cell, GPUs, …)

– Multicore processors widely used

– Current trends:

04/21/23

Source: top500.org

304/21/23 LNCC-08

Page 4: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

4

Current HPC Landscape (cont.)• Processor counts:

– #1 Roadrunner@LANL: 122K

– #2 BG/L@LLNL: 212K

– #3 BG/P@ANL: 163K

• Exascale: sooner than we imagine…– U.S. Dep. of Energy town hall meetings in 2007:

• LBNL (April), ORNL (May), ANL (August)

• Goals: discuss exascale possibilities, how to accelerate it

• Sections: – Climate, Energy, Biology, Socioeconomic Modeling, Astrophysics,

Math & Algorithms, Software, Hardware

• Report: http://www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf

404/21/23 LNCC-08

Page 5: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

5

Current HPC Landscape (cont.)• Current reality:

– Steady increase in processor counts

– Systems become multicore or heterogeneous

– “Memory wall” effects worsening

– MPI programming model still dominant

• Challenges (now and into foreseeable future):– How to explore new systems’ power

– Capacity x Capability – different problems

• Capacity is a concern for system managers

• Capability is a concern for users

– How to program in parallel effectively

• Both multicore (desktop) and million-core (supercomputers)504/21/23 LNCC-08

Page 6: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Parallel Programming Lab

604/21/23 LNCC-08

Page 7: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

7

Parallel Programming Lab - PPL• http://charm.cs.uiuc.edu

• One of the largest research groups at Illinois

• Currently: – 1 faculty, 3 research scientists, 4 research programmers

– 13 grad students, 1 undergrad student

– Open positions

04/21/23

PPL, April’2008

704/21/23 LNCC-08

Page 8: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

8

PPL Mission and Approach• To enhance Performance and Productivity in

programming complex parallel applications– Performance: scalable to thousands of processors

– Productivity: of human programmers

– Complex: irregular structure, dynamic variations

• Application-oriented yet CS-centered research– Develop enabling technology, for a wide collection of apps.

– Develop, use and test it in the context of real applications

– Embody it into easy to use abstractions

– Implementation: Charm++

• Object-oriented runtime infrastructure

• Freely available for non-commercial use04/21/23 804/21/23 LNCC-08

Page 9: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Application-Oriented Parallel Abstractions

NAMD Charm++

Other

Applications

Issues

Techniques & libraries

Synergy between Computer Science research and applications has been beneficial to both

ChaNGa

LeanCP Space-time meshing

Rocket Simulation

904/21/23 LNCC-08

Page 10: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Programming Methodology

1004/21/23 LNCC-08

Page 11: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Methodology: Migratable Objects

User View

System implementation

Programmer: [Over] decomposition into objects (“virtual processors” - VPs)

Runtime: Assigns VPs to real processors dynamically, during execution

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

• Software engineering– Number of virtual processors can

be independently controlled– Separate VP sets for different

modules in an application• Message driven execution

– Adaptive overlap of computation/communication

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load

balancing– Communication optimization

Benefits of Virtualization

1104/21/23 LNCC-08

Page 12: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Adaptive MPI (AMPI): MPI + Virtualization

• Each virtual process implemented as a user-level thread embedded in a Charm object– Must properly handle globals and statics (analogous to what’s needed in OpenMP)

– But… thread context-switch is much faster than other techniques

MPI processes

Real Processors

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

1204/21/23 LNCC-08

Page 13: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Parallel Decomposition and Processors

• MPI-style:– Encourages decomposition into P pieces, where P is the

number of physical processors available

– If the natural decomposition is a cube, then the number of processors must be a cube

– Overlap of comput./communication is a user’s responsibility

• Charm++/AMPI style: “virtual processors”– Decompose into natural objects of the application

– Let the runtime map them to physical processors

– Decouple decomposition from load balancing

1304/21/23 LNCC-08

Page 14: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Decomposition independent of numCores

• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

– Benefits: load balance, communication optimizations, modularity

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .

1404/21/23 LNCC-08

Page 15: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Dynamic Load Balancing• Based on Principle of Persistence

– Computational loads and communication patterns tend to persist, even in dynamic computations

– Recent past is a good predictor of near future

• Implementation in Charm++:– Computational entities (nodes, structured grid points,

particles…) are partitioned into objects

– Load from objects may be measured during execution

– Objects are migrated across processors for balancing load

– Much smaller problem than repartitioning entire dataset

– Several available policies for load-balancing decisions

1504/21/23 LNCC-08

Page 16: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Typical Load Balancing Phases

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

Time

1604/21/23 LNCC-08

Page 17: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Examples of Science & Engineering Charm++ Applications

1704/21/23 LNCC-08

Page 18: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

NAMD: A Production MD programNAMD

• Fully featured program

• NIH-funded development

• Distributed free of charge (~20,000 registered users)

• Binaries and source code

• Installed at NSF centers

• 20% cycles (NCSA, PSC)

• User training and support

• Large published simulations

• Gordon-Bell award in 2002• URL: www.ks.uiuc.edu/Research/namd

1804/21/23 LNCC-08

Page 19: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Spatial Decomposition Via Charm++

•Atoms distributed to cubes based on their location

• Size of each cube :

•Just a bit larger than cut-off radius

•Communicate only with neighbors

•Work: for each pair of nbr objects

•C/C ratio: O(1)

•However:

•Load Imbalance

•Limited Parallelism

Cells, Cubes or “Patches”

Charm++ is useful to handle this

1904/21/23 LNCC-08

Page 20: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Force Decomposition + Spatial Decomposition

• Now, we have many objects to apply load-balancing:

• Each diamond can be assigned to any processor

• Number of diamonds (3D):

–14 * Number of patches

• 2-away variation:

– Half-size cubes, 5x5x5 inter.

• 3-away interactions: 7x7x7

• Prototype NAMD versions created for Cell, GPUs

Object-based Parallelization for MD

2004/21/23 LNCC-08

Page 21: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Performance of NAMD: STMV

Number of cores

STMV: ~1 million atoms

2104/21/23 LNCC-08

Page 22: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

222204/21/23 LNCC-08

Page 23: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

23

ChaNGa: Cosmological Simulations

• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington

• Components: gravity (done), gas dynamics (almost)

• Barnes-Hut tree code– Particles represented hierarchically in a tree according to their

spatial position– “Pieces” of the tree distributed across processors– Gravity computation:

• “Nearby” particles: computed precisely• “Distant” particles: approximated by remote node’s center• Software-caching mechanism, critical for performance• Multi-timestepping: update frequently only the fastest

particles (see Jetley et al, IPDPS’2008)

2304/21/23 LNCC-08

Page 24: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

ChaNGa Performance• Results obtained on BlueGene/L

• No multi-timestepping, simple load-balancers

2404/21/23 LNCC-08

Page 25: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Other Opportunities

2504/21/23 LNCC-08

Page 26: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

MPI Extensions in AMPI• Automatic load balancing

– MPI_Migrate(): collective operation, possible migration

• Asynchronous collective operations– e.g. MPI_Ialltoall()

• Post operation, test/wait for completion; do work in between

• Checkpointing support– MPI_Checkpoint()

• Checkpoint into disk

– MPI_MemCheckpoint()

• Checkpoint in memory, with remote redundancy

2604/21/23 LNCC-08

Page 27: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Performance Tuning for Future Machines• For example, Blue Waters will arrive in 2011

– But we need to prepare applications for it, starting now

• Even for extant machines:– Full size machine may not be available as often as needed for

tuning runs

• A simulation-based approach is needed

• Our approach: BigSim– Based on Charm++ virtualization approach

– Full scale program Emulation

– Trace-driven Simulation

– History: developed for BlueGene predictions

2704/21/23 LNCC-08

Page 28: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

BigSim Simulation System• General system organization

• Emulation: – Run an existing, full-scale MPI, AMPI or Charm++ application

– Uses an emulation layer that pretends to be (say) 100k cores

• Target cores are emulated as Charm+ virtual processors

– Resulting traces (aka logs):

• Characteristics of SEBs (Sequential Execution Blocks)

• Dependences between SEBs and messages2804/21/23 LNCC-08

Page 29: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

BigSim Simulation System (cont.)• Trace driven parallel simulation

– Typically run on tens to hundreds of processors

– Multiple resolution simulation of sequential execution:

• from simple scaling factor to cycle-accurate modeling

– Multiple resolution simulation of the Network:

• from simple latency/bw model to detailed packet and switching port level modeling

– Generates Timing traces just as a real app on full scale machine

• Phase 3: Analyze performance– Identify bottlenecks, even w/o predicting exact performance

– Carry out various “what-if” analysis

2904/21/23 LNCC-08

Page 30: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Projections: Performance Visualization

3004/21/23 LNCC-08

Page 31: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

BigSim Validation: BG/L Predictions

NAMD Apoa1

0

20

40

60

80

128 256 512 1024 2250

number of processors simulated

time (se

conds)

Actualexecutiontimepredictedtime

3104/21/23 LNCC-08

Page 32: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Some Ongoing Research Directions

3204/21/23 LNCC-08

Page 33: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Load Balancing for Large Machines: I• Centralized balancers achieve best balance

– Collect object-communication graph on one processor

– But won’t scale beyond tens of thousands of nodes

• Fully distributed load balancers– Avoid bottleneck but.. Achieve poor load balance

– Not adequately agile

• Hierarchical load balancers– Careful control of what information goes up and down the

hierarchy can lead to fast, high-quality balancers

3304/21/23 LNCC-08

Page 34: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Load Balancing for Large Machines: II• Interconnection topology starts to matter again

– Was hidden due to wormhole routing etc.

– Latency variation is still small...

– But bandwidth occupancy (link contention) is a problem

• Topology aware load balancers– Some general heuristics have shown good performance

• But may require too much compute power

– Also, special-purpose heuristic work fine when applicable

– Preliminary results:

• see Bhatele & Kale’s paper, LSPP@IPDPS’2008

– Still, many open challenges

3404/21/23 LNCC-08

Page 35: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Major Challenges in Applications• NAMD:

– Scalable PME (long range forces) – 3D FFT

• Specialized balancers for multi-resolution cases– Ex: ChaNGa running highly-clustered cosmological datasets

and multi-timestepping

proc

esso

r

Time Black: Processor Activity

(a) Singlestepping (b) Multi-timestepping (c) Multi-timestepping + special load-balancing

3504/21/23 LNCC-08

Page 36: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

BigSim: Challenges• BigSim’s simple diagram hides many complexities

• Emulation: – Automatic Out-of-core support for large memory footprint apps

• Simulation:– Accuracy vs cost tradeoffs– Interpolation mechanisms for prediction of serial performance– Memory management optimizations– I/O optimizations for handling (many) large trace files

• Performance analysis:– Need scalable tools

• Active area of research3604/21/23 LNCC-08

Page 37: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Fault Tolerance• Automatic Checkpointing

– Migrate objects to disk

– In-memory checkpointing as an option

– Both schemes above are available in Charm++

• Proactive Fault Handling– Migrate objects to other processors

upon detecting imminent fault

– Adjust processor-level parallel data structures

– Rebalance load after migrations

– HiPC’07 paper: Chakravorty et al

• Scalable fault tolerance– When a processor out of 100,000

fails, all 99,999 shouldn’t have to run back to their checkpoints!

– Sender-side message logging

– Restart can be speeded up by spreading out objects from failed processor

– IPDPS’07 paper: Chakravorty & Kale

– Ongoing effort to minimize logging protocol overheads

3704/21/23 LNCC-08

Page 38: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Higher Level Languages & Interoperability

3804/21/23 LNCC-08

Page 39: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

HPC at Illinois

3904/21/23 LNCC-08

Page 40: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

40

HPC at Illinois• Many other exciting developments

– Microsoft/Intel parallel computing research center

– Parallel Programming Classes

• CS-420: Parallel Programming Sci. and Enginnering

• ECE-498: NVIDIA/ECE collaboration

– HP/Intel/Yahoo! Institute

– NCSA’s Blue Waters system approved for 2011

• see http://www.ncsa.uiuc.edu/BlueWaters/

– NCSA/IACAT new institute

• see http://www.iacat.uiuc.edu/

04/21/23 4004/21/23 LNCC-08

Page 41: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

41

Microsoft/Intel UPCRC• Universal Parallel Computing Research Center

• 5 year funding, 2 centers:– Univ.Illinois & Univ.Cal.-Berkeley

• Joint effort by Intel/Microsoft: $2M/year

• Mission:– Conduct research to make parallel programming broadly

accessible and “easy”

• Focus areas:– Programming, Translation, Execution, Applications

• URL: http://www.upcrc.illinois.edu/

04/21/23 4104/21/23 LNCC-08

Page 42: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

42

Parallel Programming Classes• CS-420: Parallel Programming

– Introduction to fundamental issues in parallelism

– Students from both CS and other engineering areas

– Offered every semester, by CS Profs. Kale or Padua

• ECE-498: Progr. Massively Parallel Processors– Focus on GPU programming techniques

– ECE Prof. Wen-Mei Hwu

– NVIDIA’s Chief Scientist David Kirk– URL: http://courses.ece.uiuc.edu/ece498/al1

04/21/23 4204/21/23 LNCC-08

Page 43: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

43

HP/Intel/Yahoo! Initiative • Cloud Computing Testbed - worldwide

• Goal: – Study Internet-scale systems, focusing on data-intensive

applications using distributed computational resources

• Areas of study: – Networking, OS, virtual machines, distributed systems, data-

mining, Web search, network measurement, and multimedia

• Illinois/CS testbed site:– 1,024-core HP system with 200 TB of disk space

– External access via an upcoming proposal selection process

• URL: http://www.hp.com/hpinfo/newsroom/press/2008/080729xa.html

04/21/23 4304/21/23 LNCC-08

Page 44: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Our Sponsors

4404/21/23 LNCC-08

Page 45: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

45

PPL Funding Sources • National Science Foundation

– BigSim, Cosmology, Languages

• Dep. of Energy– Charm++ (Load-Bal., Fault-Toler.), Quantum Chemistry

• National Institutes of Health– NAMD

• NCSA/NSF, NCSA/IACAT– Blue Waters project, applications

• Dep. of Energy / UIUC Rocket Center– AMPI, applications

• Nasa– Cosmology/Visualization

04/21/23 4504/21/23 LNCC-08

Page 46: Preparing for Petascale and Beyond Celso L. Mendes  Parallel Programming Laboratory Department of Computer Science

Obrigado !

4604/21/23 LNCC-08