preparing for petascale and beyond celso l. mendes parallel programming laboratory department of...

Post on 17-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Preparing for Petascale and Beyond

Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

2

Presentation Outline• Present Status

– HPC Landscape, Petascale, Exascale

• Parallel Programming Lab – Mission and approach

– Programming methodology

– Scalability results for S&E applications

– Other extensions and opportunities

– Some ongoing research directions

• Happening at Illinois– BlueWaters, NCSA/IACAT

– Intel/Microsoft, NVIDIA, HP/Intel/Yahoo!, …

04/21/23 204/21/23 LNCC-08

3

Current HPC Landscape• Petascale era started!

– Roadrunner@LANL (#1 in Top500):

• Linpack: 1.026 Pflops, Peak: 1,375 Pflops

– Heterogeneous systems starting to spread (Cell, GPUs, …)

– Multicore processors widely used

– Current trends:

04/21/23

Source: top500.org

304/21/23 LNCC-08

4

Current HPC Landscape (cont.)• Processor counts:

– #1 Roadrunner@LANL: 122K

– #2 BG/L@LLNL: 212K

– #3 BG/P@ANL: 163K

• Exascale: sooner than we imagine…– U.S. Dep. of Energy town hall meetings in 2007:

• LBNL (April), ORNL (May), ANL (August)

• Goals: discuss exascale possibilities, how to accelerate it

• Sections: – Climate, Energy, Biology, Socioeconomic Modeling, Astrophysics,

Math & Algorithms, Software, Hardware

• Report: http://www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf

404/21/23 LNCC-08

5

Current HPC Landscape (cont.)• Current reality:

– Steady increase in processor counts

– Systems become multicore or heterogeneous

– “Memory wall” effects worsening

– MPI programming model still dominant

• Challenges (now and into foreseeable future):– How to explore new systems’ power

– Capacity x Capability – different problems

• Capacity is a concern for system managers

• Capability is a concern for users

– How to program in parallel effectively

• Both multicore (desktop) and million-core (supercomputers)504/21/23 LNCC-08

Parallel Programming Lab

604/21/23 LNCC-08

7

Parallel Programming Lab - PPL• http://charm.cs.uiuc.edu

• One of the largest research groups at Illinois

• Currently: – 1 faculty, 3 research scientists, 4 research programmers

– 13 grad students, 1 undergrad student

– Open positions

04/21/23

PPL, April’2008

704/21/23 LNCC-08

8

PPL Mission and Approach• To enhance Performance and Productivity in

programming complex parallel applications– Performance: scalable to thousands of processors

– Productivity: of human programmers

– Complex: irregular structure, dynamic variations

• Application-oriented yet CS-centered research– Develop enabling technology, for a wide collection of apps.

– Develop, use and test it in the context of real applications

– Embody it into easy to use abstractions

– Implementation: Charm++

• Object-oriented runtime infrastructure

• Freely available for non-commercial use04/21/23 804/21/23 LNCC-08

Application-Oriented Parallel Abstractions

NAMD Charm++

Other

Applications

Issues

Techniques & libraries

Synergy between Computer Science research and applications has been beneficial to both

ChaNGa

LeanCP Space-time meshing

Rocket Simulation

904/21/23 LNCC-08

Programming Methodology

1004/21/23 LNCC-08

Methodology: Migratable Objects

User View

System implementation

Programmer: [Over] decomposition into objects (“virtual processors” - VPs)

Runtime: Assigns VPs to real processors dynamically, during execution

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

• Software engineering– Number of virtual processors can

be independently controlled– Separate VP sets for different

modules in an application• Message driven execution

– Adaptive overlap of computation/communication

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load

balancing– Communication optimization

Benefits of Virtualization

1104/21/23 LNCC-08

Adaptive MPI (AMPI): MPI + Virtualization

• Each virtual process implemented as a user-level thread embedded in a Charm object– Must properly handle globals and statics (analogous to what’s needed in OpenMP)

– But… thread context-switch is much faster than other techniques

MPI processes

Real Processors

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

1204/21/23 LNCC-08

Parallel Decomposition and Processors

• MPI-style:– Encourages decomposition into P pieces, where P is the

number of physical processors available

– If the natural decomposition is a cube, then the number of processors must be a cube

– Overlap of comput./communication is a user’s responsibility

• Charm++/AMPI style: “virtual processors”– Decompose into natural objects of the application

– Let the runtime map them to physical processors

– Decouple decomposition from load balancing

1304/21/23 LNCC-08

Decomposition independent of numCores

• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

– Benefits: load balance, communication optimizations, modularity

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .

1404/21/23 LNCC-08

Dynamic Load Balancing• Based on Principle of Persistence

– Computational loads and communication patterns tend to persist, even in dynamic computations

– Recent past is a good predictor of near future

• Implementation in Charm++:– Computational entities (nodes, structured grid points,

particles…) are partitioned into objects

– Load from objects may be measured during execution

– Objects are migrated across processors for balancing load

– Much smaller problem than repartitioning entire dataset

– Several available policies for load-balancing decisions

1504/21/23 LNCC-08

Typical Load Balancing Phases

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

Time

1604/21/23 LNCC-08

Examples of Science & Engineering Charm++ Applications

1704/21/23 LNCC-08

NAMD: A Production MD programNAMD

• Fully featured program

• NIH-funded development

• Distributed free of charge (~20,000 registered users)

• Binaries and source code

• Installed at NSF centers

• 20% cycles (NCSA, PSC)

• User training and support

• Large published simulations

• Gordon-Bell award in 2002• URL: www.ks.uiuc.edu/Research/namd

1804/21/23 LNCC-08

Spatial Decomposition Via Charm++

•Atoms distributed to cubes based on their location

• Size of each cube :

•Just a bit larger than cut-off radius

•Communicate only with neighbors

•Work: for each pair of nbr objects

•C/C ratio: O(1)

•However:

•Load Imbalance

•Limited Parallelism

Cells, Cubes or “Patches”

Charm++ is useful to handle this

1904/21/23 LNCC-08

Force Decomposition + Spatial Decomposition

• Now, we have many objects to apply load-balancing:

• Each diamond can be assigned to any processor

• Number of diamonds (3D):

–14 * Number of patches

• 2-away variation:

– Half-size cubes, 5x5x5 inter.

• 3-away interactions: 7x7x7

• Prototype NAMD versions created for Cell, GPUs

Object-based Parallelization for MD

2004/21/23 LNCC-08

Performance of NAMD: STMV

Number of cores

STMV: ~1 million atoms

2104/21/23 LNCC-08

222204/21/23 LNCC-08

23

ChaNGa: Cosmological Simulations

• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington

• Components: gravity (done), gas dynamics (almost)

• Barnes-Hut tree code– Particles represented hierarchically in a tree according to their

spatial position– “Pieces” of the tree distributed across processors– Gravity computation:

• “Nearby” particles: computed precisely• “Distant” particles: approximated by remote node’s center• Software-caching mechanism, critical for performance• Multi-timestepping: update frequently only the fastest

particles (see Jetley et al, IPDPS’2008)

2304/21/23 LNCC-08

ChaNGa Performance• Results obtained on BlueGene/L

• No multi-timestepping, simple load-balancers

2404/21/23 LNCC-08

Other Opportunities

2504/21/23 LNCC-08

MPI Extensions in AMPI• Automatic load balancing

– MPI_Migrate(): collective operation, possible migration

• Asynchronous collective operations– e.g. MPI_Ialltoall()

• Post operation, test/wait for completion; do work in between

• Checkpointing support– MPI_Checkpoint()

• Checkpoint into disk

– MPI_MemCheckpoint()

• Checkpoint in memory, with remote redundancy

2604/21/23 LNCC-08

Performance Tuning for Future Machines• For example, Blue Waters will arrive in 2011

– But we need to prepare applications for it, starting now

• Even for extant machines:– Full size machine may not be available as often as needed for

tuning runs

• A simulation-based approach is needed

• Our approach: BigSim– Based on Charm++ virtualization approach

– Full scale program Emulation

– Trace-driven Simulation

– History: developed for BlueGene predictions

2704/21/23 LNCC-08

BigSim Simulation System• General system organization

• Emulation: – Run an existing, full-scale MPI, AMPI or Charm++ application

– Uses an emulation layer that pretends to be (say) 100k cores

• Target cores are emulated as Charm+ virtual processors

– Resulting traces (aka logs):

• Characteristics of SEBs (Sequential Execution Blocks)

• Dependences between SEBs and messages2804/21/23 LNCC-08

BigSim Simulation System (cont.)• Trace driven parallel simulation

– Typically run on tens to hundreds of processors

– Multiple resolution simulation of sequential execution:

• from simple scaling factor to cycle-accurate modeling

– Multiple resolution simulation of the Network:

• from simple latency/bw model to detailed packet and switching port level modeling

– Generates Timing traces just as a real app on full scale machine

• Phase 3: Analyze performance– Identify bottlenecks, even w/o predicting exact performance

– Carry out various “what-if” analysis

2904/21/23 LNCC-08

Projections: Performance Visualization

3004/21/23 LNCC-08

BigSim Validation: BG/L Predictions

NAMD Apoa1

0

20

40

60

80

128 256 512 1024 2250

number of processors simulated

time (se

conds)

Actualexecutiontimepredictedtime

3104/21/23 LNCC-08

Some Ongoing Research Directions

3204/21/23 LNCC-08

Load Balancing for Large Machines: I• Centralized balancers achieve best balance

– Collect object-communication graph on one processor

– But won’t scale beyond tens of thousands of nodes

• Fully distributed load balancers– Avoid bottleneck but.. Achieve poor load balance

– Not adequately agile

• Hierarchical load balancers– Careful control of what information goes up and down the

hierarchy can lead to fast, high-quality balancers

3304/21/23 LNCC-08

Load Balancing for Large Machines: II• Interconnection topology starts to matter again

– Was hidden due to wormhole routing etc.

– Latency variation is still small...

– But bandwidth occupancy (link contention) is a problem

• Topology aware load balancers– Some general heuristics have shown good performance

• But may require too much compute power

– Also, special-purpose heuristic work fine when applicable

– Preliminary results:

• see Bhatele & Kale’s paper, LSPP@IPDPS’2008

– Still, many open challenges

3404/21/23 LNCC-08

Major Challenges in Applications• NAMD:

– Scalable PME (long range forces) – 3D FFT

• Specialized balancers for multi-resolution cases– Ex: ChaNGa running highly-clustered cosmological datasets

and multi-timestepping

proc

esso

r

Time Black: Processor Activity

(a) Singlestepping (b) Multi-timestepping (c) Multi-timestepping + special load-balancing

3504/21/23 LNCC-08

BigSim: Challenges• BigSim’s simple diagram hides many complexities

• Emulation: – Automatic Out-of-core support for large memory footprint apps

• Simulation:– Accuracy vs cost tradeoffs– Interpolation mechanisms for prediction of serial performance– Memory management optimizations– I/O optimizations for handling (many) large trace files

• Performance analysis:– Need scalable tools

• Active area of research3604/21/23 LNCC-08

Fault Tolerance• Automatic Checkpointing

– Migrate objects to disk

– In-memory checkpointing as an option

– Both schemes above are available in Charm++

• Proactive Fault Handling– Migrate objects to other processors

upon detecting imminent fault

– Adjust processor-level parallel data structures

– Rebalance load after migrations

– HiPC’07 paper: Chakravorty et al

• Scalable fault tolerance– When a processor out of 100,000

fails, all 99,999 shouldn’t have to run back to their checkpoints!

– Sender-side message logging

– Restart can be speeded up by spreading out objects from failed processor

– IPDPS’07 paper: Chakravorty & Kale

– Ongoing effort to minimize logging protocol overheads

3704/21/23 LNCC-08

Higher Level Languages & Interoperability

3804/21/23 LNCC-08

HPC at Illinois

3904/21/23 LNCC-08

40

HPC at Illinois• Many other exciting developments

– Microsoft/Intel parallel computing research center

– Parallel Programming Classes

• CS-420: Parallel Programming Sci. and Enginnering

• ECE-498: NVIDIA/ECE collaboration

– HP/Intel/Yahoo! Institute

– NCSA’s Blue Waters system approved for 2011

• see http://www.ncsa.uiuc.edu/BlueWaters/

– NCSA/IACAT new institute

• see http://www.iacat.uiuc.edu/

04/21/23 4004/21/23 LNCC-08

41

Microsoft/Intel UPCRC• Universal Parallel Computing Research Center

• 5 year funding, 2 centers:– Univ.Illinois & Univ.Cal.-Berkeley

• Joint effort by Intel/Microsoft: $2M/year

• Mission:– Conduct research to make parallel programming broadly

accessible and “easy”

• Focus areas:– Programming, Translation, Execution, Applications

• URL: http://www.upcrc.illinois.edu/

04/21/23 4104/21/23 LNCC-08

42

Parallel Programming Classes• CS-420: Parallel Programming

– Introduction to fundamental issues in parallelism

– Students from both CS and other engineering areas

– Offered every semester, by CS Profs. Kale or Padua

• ECE-498: Progr. Massively Parallel Processors– Focus on GPU programming techniques

– ECE Prof. Wen-Mei Hwu

– NVIDIA’s Chief Scientist David Kirk– URL: http://courses.ece.uiuc.edu/ece498/al1

04/21/23 4204/21/23 LNCC-08

43

HP/Intel/Yahoo! Initiative • Cloud Computing Testbed - worldwide

• Goal: – Study Internet-scale systems, focusing on data-intensive

applications using distributed computational resources

• Areas of study: – Networking, OS, virtual machines, distributed systems, data-

mining, Web search, network measurement, and multimedia

• Illinois/CS testbed site:– 1,024-core HP system with 200 TB of disk space

– External access via an upcoming proposal selection process

• URL: http://www.hp.com/hpinfo/newsroom/press/2008/080729xa.html

04/21/23 4304/21/23 LNCC-08

Our Sponsors

4404/21/23 LNCC-08

45

PPL Funding Sources • National Science Foundation

– BigSim, Cosmology, Languages

• Dep. of Energy– Charm++ (Load-Bal., Fault-Toler.), Quantum Chemistry

• National Institutes of Health– NAMD

• NCSA/NSF, NCSA/IACAT– Blue Waters project, applications

• Dep. of Energy / UIUC Rocket Center– AMPI, applications

• Nasa– Cosmology/Visualization

04/21/23 4504/21/23 LNCC-08

Obrigado !

4604/21/23 LNCC-08

top related