1 petascale programming with virtual processors: charm++, ampi, and domain-specific frameworks...

142
1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain- specific frameworks Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

Upload: sibyl-byrd

Post on 26-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

1

Petascale Programming with Virtual Processors:

Charm++, AMPI, and domain-specific frameworksLaxmikant Kale

http://charm.cs.uiuc.eduParallel Programming Laboratory

Dept. of Computer Science

University of Illinois at Urbana Champaign

Page 2: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

2

Outline

• Challenges and opportunities: character of the new machines

• Charm++ and AMPI– Basics

– Capabilities,

• Programming techniques– Dice them fine:

• VPS to the rescue

– Juggling for overlap

– Load balancing:

• scenarios and strategies

• Case studies– Classical Molecular Dynamics – Car-Parinello AI MD Quantum

Chemistry – Rocket SImulation

• Raising level of abstraction:– Higher level compiler

supported notations– Domain-specific “frameworks”– Example:

• Unstructured mesh (FEM) framework

Page 3: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

3

Machines: current, planned and future

• Current: – Lemieux: 3000 processors, 750

nodes, full-bandwidth fat-tree network

– ASCI Q: similar architecture

– System X: Infiniband

– Tungston: myrinet

– Thunder

– Earth Simulator

• Planned:– IBM’s Blue Gene L: 65k

nodes, 3D-taurus topology

– Red Storm (10k procs)

• Future?– BG/L is an example:

• 1M processors!

• 0.5 MB per procesor

– HPCS 3 architecural plans

Page 4: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

4

Some Trends: Communication

• Bisection bandwidth:– Can’t scale as well with number of processors

• without being expensive

– Wire-length delays

• even on lemieux: messages going thru the highest level switches take longer

• Two possibilities:– Grid topologies, with near neighbor connections

• High Link speed, low bisection bandwidth

– Expensive, full-bandwidth networks

Page 5: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

5

Trends: Memory

• Memory latencies are 100 times slower than processor!– This will get worse

• A solution: put more processors in, – To increase bandwidth between processors and memory

– On chip DRAM

– In other words: low memory-to-processor ratio

• But this can be handled with programming style

• Application viewpoint, for physical modeling:– Given a fixed amount of run-time (4 hours or 10 days)

– Doubling spatial resolution

• increases CPU needs more than 2-fold (smaller time-steps)

Page 6: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

6

Application Complexity is increasing

• Why?– With more FLOPS, need better algorithms..

• Not enough to just do more of the same..

• Example: Dendritic growth in materials

– Better algorithms lead to complex structure

– Example: Gravitational force calculation

• Direct all-pairs: O(N2), but easy to parallelize

• Barnes-Hut: N log(N) but more complex

– Multiple modules, dual time-stepping

– Adaptive and dynamic refinements

• Ambitious projects – Projects with new objectives lead to dynamic behavior and multiple

components

Page 7: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

7

Specific Programming Challenges

• Explicit management of resources– This data on that processor– This work on that processor

• Analogy: memory management– We declare arrays, and malloc dynamic memory chunks as

needed– Do not specify memory addresses

• As usual, Indirection is the key– Programmer:

• This data, partitioned into these pieces• This work divided that way

– System: map data and work to processors

Page 8: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

8

Virtualization: Object-based Parallelization

User View

System implementationUser is only concerned with interaction between objects

•Idea: Divide the computation into a large number of objects

–Let the system map objects to processors

Page 9: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

9

Virtualization: Charm++ and AMPI• These systems seek an optimal division of labor between

the “system” and programmer:– Decomposition done by programmer, – Everything else automated

Specialization

MPIExpression

Scheduling

Mapping

Decomposition

HPFCharm++

Abs

trac

tion

Page 10: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

10

Charm++ and Adaptive MPI• Charm++: Parallel C++

• Asynchronous methods

• Object arrays

• In development for over a decade

• Basis of several parallel applications

• Runs on all popular parallel machines and clusters

• AMPI: A migration path for legacy MPI codes – Allows them dynamic load

balancing capabilities of Charm++

• Uses Charm++ object arrays

• Minimal modifications to convert existing MPI programs – Automated via AMPizer

– Collaboration w. David Padua

• Bindings for – C, C++, and Fortran90

Both available from http://charm.cs.uiuc.edu

Page 11: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

11

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

Molecular Dynamics

Crack Propagation

Space-time meshes

Computational Cosmology

Rocket Simulation

Protein Folding

Dendritic Growth

Quantum Chemistry (QM/MM)

Page 12: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

12

Message From This Talk

• Virtualization is ready and powerful to meet the needs of tomorrows applications and machines

• Virtualization and associated techniques that we have been exploring for the past decade are ready and powerful enough to meet the needs of high-end parallel computing and complex and dynamic applications

• These techniques are embodied into:– Charm++– AMPI– Frameworks (Strucured Grids, Unstructured Grids, Particles)– Virtualization of other coordination languages (UPC, GA, ..)

Page 13: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

13

Acknowlwdgements

• Graduate students including:– Gengbin Zheng

– Orion Lawlor

– Milind Bhandarkar

– Terry Wilmarth

– Sameer Kumar

– Jay deSouza

– Chao Huang

– Chee Wai Lee

• Recent Funding:– NSF (NGS: Frederica Darema)

– DOE (ASCI : Rocket Center)

– NIH (Molecular Dynamics)

Page 14: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

14

Charm++ : Object Arrays

• A collection of data-driven objects (aka chares), – With a single global name for the collection, and

– Each member addressed by an index

– Mapping of element objects to processors handled by the system

A[0] A[1] A[2] A[3] A[..]

User’s view

Page 15: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

15

Charm++ : Object Arrays

• A collection of chares, – with a single global name for the collection, and

– each member addressed by an index

– Mapping of element objects to processors handled by the system

A[0] A[1] A[2] A[3] A[..]

A[3]A[0]

User’s view

System view

Page 16: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

16

Chare Arrays

• Elements are data-driven objects• Elements are indexed by a user-defined data type--

[sparse] 1D, 2D, 3D, tree, ...• Send messages to index, receive messages at element.

Reductions and broadcasts across the array• Dynamic insertion, deletion, migration-- and

everything still has to work!

Page 17: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

17

Charm++ Remote Method Calls

• To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file:

array[1D] foo { entry void foo(int problemNo); entry void bar(int x); };

Interface (.ci) file

This results in a network message, and eventually to a call to the real object’s method:

void foo::bar(int x) { ...

}

In another .C file

method and parametersi’th object

CProxy_foo someFoo=...;someFoo[i].bar(17);In a .C file

Generated class

Page 18: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

18

#include “myModule.decl.h”class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); }};#include “myModule.def.h”

Charm++ Startup Process: Mainmodule myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); }};

Interface (.ci) file

In a .C file

Generated class

Called at startup

Special startup object

Page 19: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

19

Other Features

• Broadcasts and Reductions• Runtime creation and deletion

• nD and sparse array indexing

• Library support (“modules”)• Groups: per-processor objects• Node Groups: per-node objects• Priorities: control ordering

Page 20: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

20

AMPI: “Adaptive” MPI• MPI interface, for C and Fortran, implemented on

Charm++• Multiple “virtual processors” per physical processor

– Implemented as user-level threads

• Very fast context switching-- 1us

– E.g., MPI_Recv only blocks virtual processor, not physical

• Supports migration (and hence load balancing) via extensions to MPI

Page 21: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

21

AMPI:7 MPI processes

Page 22: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

22

AMPI:

Real Processors

7 MPI “processes”

Implemented as virtual processors (user-level migratable threads)

Page 23: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

23

How to Write an AMPI Program• Write your normal MPI program, and then…

• Link and run with Charm++– Compile and link with charmc

• charmc -o hello hello.c -language ampi

• charmc -o hello2 hello.f90 -language ampif

– Run with charmrun

• charmrun hello

Page 24: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

24

How to Run an AMPI program

• Charmrun– A portable parallel job execution script

– Specify number of physical processors: +pN

– Specify number of virtual MPI processes: +vpN

– Special “nodelist” file for net-* versions

Page 25: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

25

AMPI MPI Extensions

• Process Migration• Asynchronous Collectives• Checkpoint/Restart

Page 26: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

26

How to Migrate a Virtual Processor? • Move all application state to new processor• Stack Data

– Subroutine variables and calls– Managed by compiler

• Heap Data– Allocated with malloc/free– Managed by user

• Global Variables

Page 27: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

27

Stack Data• The stack is used by the compiler to track function

calls and provide temporary storage– Local Variables

– Subroutine Parameters

– C “alloca” storage

• Most of the variables in a typical application are stack data

Page 28: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

28

Migrate Stack Data• Without compiler support, cannot change stack’s

address– Because we can’t change stack’s interior pointers (return

frame pointer, function arguments, etc.)

• Solution: “isomalloc” addresses– Reserve address space on every processor for every thread

stack

– Use mmap to scatter stacks in virtual memory efficiently

– Idea comes from PM2

Page 29: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

29

Migrate Stack Data

Thread 2 stackThread 3 stackThread 4 stack

Processor A’s Memory

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

Code

Globals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Migrate Thread 3

Page 30: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

30

Migrate Stack Data

Thread 2 stack

Thread 4 stack

Processor A’s Memory

Code

Globals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

Code

Globals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Migrate Thread 3

Thread 3 stack

Page 31: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

31

Migrate Stack Data• Isomalloc is a completely automatic solution

– No changes needed in application or compilers

– Just like a software shared-memory system, but with proactive paging

• But has a few limitations– Depends on having large quantities of virtual address space (best on

64-bit)

• 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine

– Depends on unportable mmap

• Which addresses are safe? (We must guess!)

• What about Windows? Blue Gene?

Page 32: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

32

Heap Data• Heap data is any dynamically allocated data

– C “malloc” and “free”

– C++ “new” and “delete”

– F90 “ALLOCATE” and “DEALLOCATE”

• Arrays and linked data structures are almost always heap data

Page 33: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

33

Migrate Heap Data• Automatic solution: isomalloc all heap data just like

stacks!– “-memory isomalloc” link option– Overrides malloc/free– No new application code needed– Same limitations as isomalloc

• Manual solution: application moves its heap data– Need to be able to size message buffer, pack data into

message, and unpack on other side– “pup” abstraction does all three

Page 34: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

34

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

1

10

100

10 100 1000Procs

Ex

ec

Tim

e [

se

c]

Native MPI AMPI

Comparison with Native MPI

• Performance– Slightly worse w/o optimization

– Being improved

• Flexibility – Small number of PE available

– Special requirement by algorithm

Page 35: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

35

Benefits of Virtualization

• Software engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message driven execution– Adaptive overlap of communication– Modularity– Predictability

• Automatic out-of-core– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used

• Principle of persistence– Enables runtime

optimizations

– Automatic dynamic load balancing

– Communication optimizations

– Other runtime optimizations

More info:

http://charm.cs.uiuc.edu

Page 36: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

36

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 37: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

37

Adaptive Overlap of Communication

• With Virtualization, you get Data-driven execution– There are multiple entities (objects, threads) on each proc

• No single object or threads holds up the processor

• Each one is “continued” when its data arrives

– No need to guess which is likely to arrive first

– So: Achieves automatic and adaptive overlap of computation and communication

• This kind of data-driven idea can be used in MPI as well.– Using wild-card receives

– But as the program gets more complex, it gets harder to keep track of all pending communication in all places that are doing a receive

Page 38: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

38

Why Message-Driven Modules ?

SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Page 39: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

39

Checkpoint/Restart• Any long running application must be able to save

its state• When you checkpoint an application, it uses the pup

routine to store the state of all objects• State information is saved in a directory of your

choosing• Restore also uses pup, so no additional application

code is needed (pup is all you need)

Page 40: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

40

Checkpointing Job• In AMPI, use MPI_Checkpoint(<dir>);

– Collective call; returns when checkpoint is complete

• In Charm++, use CkCheckpoint(<dir>,<resume>);– Called on one processor; calls resume when checkpoint is

complete

• Restarting:– The charmrun option ++restart <dir> is used to restart

– Number of processors need not be the same

Page 41: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

41

AMPI’s Collective Communication Support

• Communication operation in which all or a large subset participate– For example broadcast

• Performance impediment• All to all communication

– All to all personalized communication (AAPC)

– All to all multicast (AAM)

Page 42: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

42

Communication Optimization

Organize processors in a 2D (virtual) Mesh

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

2* messages instead of P-1 1P

But each byte travels twice on the network

Page 43: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

43

Performance Benchmark

0

100

200

300

400

500

600

700

800

900

100B 200B 900B 4KB 8KB

Message Size (bytes)

AA

PC

Com

plet

ion

Tim

e(m

s)

Mesh Direct A Mystery ?

0

2

4

6

8

10

12

14

16

18

20

100B 200B 900B 4KB 8KB

Message Size (bytes)

Sort

Com

plet

ion

Tim

e(s)

Mesh Direct

Radix Sort

Page 44: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

44

CPU time vs Elapsed Time

0

100

200

300

400

500

600

700

800

900

76 276 476 876 1276 1676 2076 3076 4076 6076 8076

Message Size (Bytes)

Tim

e (

ms)

Mesh

Mesh Compute

Time breakdown of an all-to-all operation using Mesh library

• Computation is only a small proportion of the elapsed time

• A number of optimization techniques are developed to improve collective communication performance

Page 45: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

45

Asynchronous Collectives

Time breakdown of 2D FFT benchmark [ms]

• VPs implemented as threads

• Overlapping computation with waiting time of collective operations

• Total completion time reduced

0 10 20 30 40 50 60 70 80 90 100

AMPI,4

Native MPI,4

AMPI,8

Native MPI,8

AMPI,16

Native MPI,161D FFT

Comm

Wait

Page 46: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

46

Shrink/Expand• Problem: Availability of computing platform may change• Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster

Additional 16 nodes available at step 600

Page 47: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

47

Page 48: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

48

Projections• Projections is designed for use with a virtualized

model like Charm++ or AMPI• Instrumentation built into runtime system• Post-mortem tool with highly detailed traces as well

as summary formats• Java-based visualization tool for presenting

performance information

Page 49: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

49

Trace Generation (Detailed)• Link-time option “-tracemode projections”

– In the log mode each event is recorded in full detail (including timestamp) in an internal buffer

– Memory footprint controlled by limiting number of log entries– I/O perturbation can be reduced by increasing number of log entries– Generates a <name>.<pe>.log file for each processor and a <name>.sts

file for the entire application

• Commonly used Run-time options+traceroot DIR+logsize NUM

Page 50: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

50

Visualization Main Window

Page 51: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

51

Post mortem analysis: views• Utilization Graph

– Mainly useful as a function of processor utilization against time and time spent on specific parallel methods

• Profile: stacked graphs: – For a given period, breakdown of the time on each processor

• Includes idle time, and message-sending, receiving times

• Timeline: – upshot-like, but more details– Pop-up views of method execution, message arrows, user-level events

Page 52: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

52

Page 53: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

53

Projections Views: continued

• Histogram of method execution times– How many method-execution instances had a time of 0-1 ms? 1-2

ms? ..

Page 54: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

54

Projections Views: continued

Overview– A fast utilization chart for entire machine across the entire

time period

Page 55: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

55

Page 56: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

56

Projections Conclusions• Instrumentation built into runtime• Easy to include in Charm++ or AMPI program• Working on

– Automated analysis

– Scaling to tens of thousands of processors

– Integration with hardware performance counters

Page 57: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

57

Multi-run analysis: in progress

• Collect performance data from different runs – On varying number of processors:

– See which functions increase in computation time:

• Algorithmic overhead

– See how the communication costs scale up

• per processor and total

Page 58: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

58

Page 59: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

59

Load balancing scenarios

• Dynamic creation of tasks– Initial vs continuous

– Coarse grained vs fine grained tasks

– Master-slave

– Tree structured

– Use “Seed Balancers” in Charm++/AMPI

• Iterative Computations– When there is a strong correlation across iterations:

• Measurement based load balancers

– When the correlation is weak

– Wehen there is no co-relation: use seed baalncer

Page 60: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

60

Measurement Based Load Balancing

• Principle of persistence– Object communication patterns and computational loads

tend to persist over time– In spite of dynamic behavior

• Abrupt but infrequent changes• Slow and small changes

• Runtime instrumentation– Measures communication volume and computation time

• Measurement based load balancers– Use the instrumented data-base periodically to make new

decisions– Many alternative strategies can use the database

Page 61: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

61

Periodic Load balancing Strategies

• Stop the computation?• Centralized strategies:

– Charm RTS collects data (on one processor) about:

• Computational Load and Communication for each pair

– If you are not using AMPI/Charm, you can do the same instrumentation and data collection

– Partition the graph of objects across processors

• Take communication into account– Pt-to-pt, as well as multicast over a subset

– As you map an object, add to the load on both sending and receiving processor

• The red communication is free, if it is a multicast.

Page 62: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

62

Object partitioning strategies• You can use graph partitioners like METIS, K-R

– BUT: graphs are smaller, and optimization criteria are different

• Greedy strategies– If communication costs are low: use a simple greedy strategy

• Sort objects by decreasing load• Maintain processors in a heap (by assigned load)

– In each step: assign the heaviest remaining object to the least loaded processor

– With small-to-moderate communication cost:• Same strategy, but add communication costs as you add an object to a

processor– Always add a refinement step at the end:

• Swap work from heaviest loaded processor to “some other processor”• Repeat a few times or until no improvement

– Refinement-only strategies

Page 63: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

63

Object partitioning strategies

• When communication cost is significant:– Still use greedy strategy, but:

• At each assignment step, choose between assigning O to least loaded processor and the processor that already has objects that communicate most with O.

– Based on the degree of difference in the two metrics

– Two-stage assignments:

» In early stages, consider communication costs as long as the processors are in the same (broad) load “class”,

» In later stages, decide based on load

• Branch-and-bound– Searches for optimal, but can be stopped after a fixed time

Page 64: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

64

Crack Propagation

Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

As computation progresses, crack propagates, and new elements are added, leading to more complex computations in some chunks

Page 65: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

65

Load balancer in action

0

5

10

15

20

25

30

35

40

45

501 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91

Iteration Number

Nu

mb

er

of

Ite

rati

on

s P

er

se

con

dAutomatic Load Balancing in Crack Propagation

1. ElementsAdded 3. Chunks

Migrated

2. Load Balancer Invoked

Page 66: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

66

Distributed Load balancing

• Centralized strategies– Still ok for 3000 processors for NAMD

• Distributed balancing is needed when:– Number of processors is large and/or

– load variation is rapid

• Large machines: – Need to handle locality of communication

• Topology sensitive placement

– Need to work with scant global information

• Approximate or aggregated global information (average/max load)

• Incomplete global info (only “neighborhood”)

• Work diffusion strategies (1980’s work by author and others!)

– Achieving global effects by local action…

Page 67: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

67

Other features

• Client-Server interface (CCS)• Live Visualization support• Libraries:

– Communication optimization libraries

– 2D, 3D FFTs, CG, ..

• Debugger:– freeze/thaw

• …

Page 68: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

68

Scaling to PetaFLOPS machines: Advice

• Dice them fine:– Use a fine grained decomposition

– Just enough to amortize the overhead

• Juggle as much as you can– Keeping communication ops in flight for latency tolerance

• Avoid synchronizations as much as possible– Use asynchronous reductions,

– Async. Collectives in general

Page 69: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

69

Grainsize control

• A Simple definition of grainsize:– Amount of computation per message

– Problem: short message/ long message

• More realistic:– Computation to communication ratio

Page 70: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

70

Grainsize Control Wisdom

• One may think that :– One should chose the largest

grainsize that will generate sufficient parallelization

• In fact:– One should select smallest

grainsize that will amortize the overhead

• Total CPU Time T– T = Tseq + (Tseq/g)Toverhead

g

TTTT seq

overheadseq

Page 71: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

71

How to avoid Barriers/Reductions

• Sometimes, they can be eliminated– with careful reasoning

– Somewhat complex programming

• When they cannot be avoided, – one can often render them harmless

• Use asynchronous reduction (not normal MPI)– E.g. in NAMD, energies need to be computed via a

reductions and output.

• Not used for anything except output

– Use Asynchronous reduction, working in the background

• When it reports to an object at the root, output it

Page 72: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

72

Asynchronous reductions: Jacobi

• Convergence check– At the end of each Jacobi iteration, we do a convergence check

– Via a scalar Reduction (on maxError)

• But note: – each processor can maintain old data for one iteration

• So, use the result of the reduction one iteration later!– Deposit of reduction is separated from its result.

– MPI_Ireduce(..) returns a handle (like MPI_Irecv)

• And later, MPI_Wait(handle) will block when you need to.

Page 73: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

73

Asynchronous reductions in Jacobi

compute compute

reduction

compute compute

reductionProcessor timeline

with sync. reduction

Processor timeline with async. reduction

This gap is avoided below

Page 74: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

74

Asynchronous or Split-phase interfaces

• Notify/wait syncs in CAF

Page 75: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

75

Case Studies Examples of Scalability

• Series of examples – Where we attained scalability

– What techniques were useful

– What lessons we learned

• Molecular Dynamics: NAMD• Rocket Simulation

Page 76: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

76

Object Based Parallelization for MD:

Force Decomposition + Spatial Deomp.

•Now, we have many objects to load balance:

–Each diamond can be assigned to any proc.

– Number of diamonds (3D):

–14·Number of Patches

Page 77: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

77

Bond Forces

• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..

– Luckily, each involves atoms in neighboring patches only

• Straightforward implementation:– Send message to all neighbors,

– receive forces from them

– 26*2 messages per patch!

• Instead, we do:– Send to (7) upstream nbrs

– Each force calculated at one patch

B

CA

Page 78: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

78

700 VPs

192 + 144 VPs

30,000 VPs

Virtualized Approach to implementation: using Charm++

These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

Page 79: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

79

Page 80: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

80

Case Study: NAMD (Molecular Dynamics)

NAMD: Biomolecular Simulation on Thousands of Processors

J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kale Proc. Of Supercomputing 2002

Gordon Bell Award

Unprecedented performance for this application0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000 3500

Processors

Spe

edup Cut-off

PME

ATPase synthase

1.02 TeraFLOPs

Page 81: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

81

Scaling to 64K/128K processors of BG/L

• What issues will arise?– Communication

• Bandwidth use more important than processor overhead

• Locality:

– Global Synchronizations

• Costly, but not because it takes longer

• Rather, small “jitters” have a large impact

• Sum of Max vs Max of Sum

– Load imbalance important, but low grainsize is crucial

– Critical paths gains importance

Page 82: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

82

Electronic Structures using CP• Car-Parinello method• Based on pinyMD

– Glenn Martyna, Mark Tuckerman

• Data structures:– A bunch of states (say 128)– Represented as

• 3D arrays of coeffs in G-space, and

• also 3D arrays in real space– Real-space prob. density– S-matrix: one number for each

pair of states• For orthonormalization

– Nuclei

• Computationally– Transformation from g-space

to real-space

• Use multiple parallel 3D-FFT

– Sums up real-space densities

– Computes energies from density

– Computes forces

– Normalizes g-space wave function

Page 83: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

83

One Iteration

Page 84: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

84

Parallel Implementation

Page 85: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

85

Page 86: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

86

Orthonormalization• At the end of every iteration, after updating electron

configuration• Need to compute (from states)

– a “correlation” matrix S, S[i,j] depends on entire data from states i, j– its transform T– Update the values

• Computation of S has to be distributed– Compute S[i,j,p], where p is plane number– Sum over p to get S[i,j]

• Actual conversion from S->T is sequential

Page 87: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

87

Orthonormalization

Page 88: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

88

Computation/Communication Overlap

Page 89: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

89

G-Space Planes:Integration, 1D-FFT

Real-Space Planes:

2D-FFTCompute Forces on/by

Nuclei

Rho-Real-Space Planes:Rho-Real-Space Planes:

Real-Space Planes:

2D-IFFT

G-Space Planes:Integration, 1D-IFFT

Pair-calculators

Page 90: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

90

Page 91: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

91

Page 92: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

92

Page 93: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

93

Page 94: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

94

Rocket Simulation• Dynamic, coupled physics

simulation in 3D• Finite-element solids on

unstructured tet mesh• Finite-volume fluids on

structured hex mesh• Coupling every timestep via

a least-squares data transfer• Challenges:

– Multiple modules

– Dynamic behavior: burning surface, mesh adaptation

Robert Fielder, Center for Simulation of Advanced Rockets

Collaboration with M. Heath, P. Geubelle, others

Page 95: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

95

Application Example: GEN2

MPI/AMPI

Rocpanda

Rocblas

Rocface

Rocman

Roccom

Rocflo-MP

Rocflu-MP

Rocsolid

Rocfrac

Rocburn2D

ZN

APN

PY

Truegrid

Tetmesh

Metis

Gridgen

Makeflo

Page 96: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

96

Rocket simulation via virtual processors

• Scalability challenges:– Multiple independently developed modules,

• possibly executing concurrently

– Evolving simulation

• Changes the balance between fluid and solid

– Adaptive refinements

– Dynamic insertion of sub-scale simulation components

• Crack-driven fluid flow and combustion

– Heterogeneous (speed-wise) clusters

Page 97: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

97

Rocket simulation via virtual processors

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

RocsolidRocface

Rocsolid

Rocface

Rocsolid

Rocface

RocsolidRocface

Rocsolid

RocfloRocflo Rocflo Rocflo

Page 98: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

98

AMPI and Roc*: Communication

Rocflo

Rocface

RocsolidRocface

Rocsolid

Rocface

Rocsolid

Rocface

RocsolidRocface

Rocsolid

RocfloRocflo Rocflo Rocflo

By separating independent modules into separate sets of virtual processors, flexibility was gained to deal with alternate formulations:

•Fluids and solids executing concurrently OR one after other.

•Change in pattern of load distribution within or across modules

Page 99: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

99

Page 100: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

100

Performance Prediction on Large Machines

• Problem:

– How to develop a parallel application for a non-existent machine?

– How to predict performance of applications on future machines?

– How to do performance tuning without continuous access to a large machine?

• Solution:

– Leverage virtualization

– Develop a machine emulator

– Simulator: accurate time modeling

– Run a program on “100,000 processors” using only hundreds of processors

Originally targeted to BlueGene/Cyclops

Now generalized (and used for BG/L)

Page 101: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

101

Why Emulate?

• Allow development of parallel software– Exposes scalability limitations in data structures, e.g.

• O(P) arrays are ok if P is not a million

– Software is ready before the machine is

Page 102: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

102

How to emulate 1M processor apps

• Leverage processor virtualization– Let each Virtual Processor of Charm++ stand for a real

processor of the emulated machine

– Adequate if you want emulate MPI apps on 1 M processors!

Page 103: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

103

Emulation on a Parallel Machine

Simulating (Host) Processors

Simulated multi-processor nodes

Simulated processor

Emulating 8M threads on 96 ASCI-Red processors

Page 104: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

104

How to emulate 1M processor apps

• A twist: what if you want to emulate Charm++ app?– E.g. 8 M objects VPs using 1M

target machine processors?

– A little runtime trickery

– Processors modeled as data structures, while VPs as VPs!

VP VP VP VP

Processor Processor

Page 105: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

105

Memory Limit?

• Some applications have low memory use– Molecular dynamics

• [Some] Large machines may have low memory-per-processor– E.g. BG/L : 256 MB for 2 processor node

– A BG/C design: 16-32 MB for 32 processor node

• More general solution is still needed:– Provided by out-of-core execution capability of Charm++

Page 106: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

106

Message Driven Execution and Out-of-Core Execution

Scheduler Scheduler

Message Q Message Q

Virtualization leads to Message Driven Execution

So, we can:Prefetch data accuratelyAutomatic Out-of-core execution

Page 107: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

107

A Success Story

• Emulation based implementation of lower layers, as well as Charm++ and AMPI completed last year

• As a result, – BG/L Port of Charm++/AMPI accomplished in 1-2 days

– Actually, 1-2 hours for the basic port

– 1-2 days to fix a OS level “bug” that prevented user-level multi-threading

Page 108: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

108

Emulator Performance

• Scalable

• Emulating a real-world MD application on a 200K processor BG machine

simulation time per step

02468

101214

4 8 16 32 64

Number of host processors

time

/ste

p(s)

Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS02

Page 109: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

109

Performance Prediction• How to predict component performance?

– Multiple resolution levels– Sequential Component:

• user supplied expression; timers; performance counters; instruction level simulation

– Communication component:• Simple latency-based network model; contention-based network

simulation

• Parallel Discrete Event Simulation (PDES)– Logical processor (LP) has virtual clock– Events are time-stamped– State of an LP changes when an event arrives to it– Protocols:Conservative vs. optimistic protocols

• Conservative: (example: DaSSF, MPISIM)• Optimistic: (examples: Time Warp, SPEEDES)

Page 110: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

110

Why not use existing PDES?

• Major synchronization overheads– Checkpointing overhead

– Rollback overhead

• We can do better– Inherent determinacy of parallel application

– Most parallel programs are written to be deterministic,

Page 111: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

111

Categories of Applications• Linear-order applications

– No wildcard receives– Strong determinacy, no timestamp

correction necessary

• Reactive applications (atomic)– Message driven objects– Methods execute as corresponding

messages arrive

• Multi-dependent applications– Irecvs with WaitAll– Uses of structured dagger to

capture dependency

Gengbin Zheng, Gunavardhan Kakulapati, Laxmikant V. Kalé, ``BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines '', in IPDPS 2004

Page 112: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

112

Charm++ and MPI applications

Simulation output trace logs

Performance visualization (Projections)

BigSim Emulator

Charm++ Runtime Online PDES engine

Instruction Sim (RSim, IBM, ..)

Simple Network Model

Performance counters

Load Balancing Module

Architecture of BigSim Simulator

Page 113: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

113

Charm++ and MPI applications

Simulation output trace logs

BigNetSim (POSE)

Network Simulator

Performance visualization (Projections)

BigSim Emulator

Charm++ Runtime Online PDES engine

Instruction Sim (RSim, IBM, ..)

Simple Network Model

Performance counters

Load Balancing Module

Offline PDES

Architecture of BigSim Simulator

Page 114: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

114

Big Network Simulation

• Simulate network behavior: packetization, routing, contention, etc.

• Incorporate with post-mortem timestamp correction via POSE• Currently models: torus (BG/L), fat-tree (qs-net)

BGSIMEmulator

POSETimestampCorrection

BG Log Files(tasks & dependencies)

Timestamp-correctedTasksBigNetSim

Page 115: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

115

BigSim Validation on Lemieux

Jacobi 3D MPI

00. 20. 40. 60. 8

11. 2

64 128 256 512number of processors simulated

time

(se

cond

s)

Actualexecution timepredicted time

Page 116: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

116

Performance of the BigSim

Real processors (PSC Lemieux)

Speed up

Page 117: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

117

FEM simulation

•Simple 2D structural simulation in AMPI•5 million element mesh•16k BG processors •Running on only 32 PSC Lemieux processors

Page 118: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

118

Case Study - LeanMD• Molecular dynamics simulation designed for

large machines

• K-away cut-off parallelization

•Benchmark er-gre with 3-away•36573 atoms•1.6 million objects vs. 6000 in 1-away•8 step simulation•32k processor BG machine•Running on 400 PSC Lemieux processors

Performance visualization tools

Page 119: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

119

Load Imbalance

Histogram

Page 120: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

120

Performance visualization

Page 121: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

121

Page 122: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

122

Component Frameworks

• Motivation– Reduce tedium of parallel programming for commonly used

paradigms– Encapsulate required parallel data structures and algorithms– Provide easy to use interface,

• Sequential programming style preserved• No alienating invasive constructs

– Use adaptive load balancing framework

• Component frameworks– FEM – Multiblock – AMR

Page 123: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

123

FEM framework

• Present clean, “almost serial” interface:– Hide parallel implementation in the runtime system

– Leave physics and time integration to user

– Users write code similar to sequential code

– Or, easily modify sequential code

• Input: – connectivity file (mesh), boundary data and initial data

• Framework:– Partitions data, and

– Starts driver for each chunk in a separate thread

– Automates communication, once user registers fields to be communicated

– Automatic dynamic load balancing

Page 124: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

124

Why use the FEM Framework?• Makes parallelizing a serial code faster and easier

– Handles mesh partitioning

– Handles communication

– Handles load balancing (via Charm)

• Allows extra features– IFEM Matrix Library

– NetFEM Visualizer

– Collision Detection Library

Page 125: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

125

Serial FEM Mesh

Element Surrounding Nodes

E1 N1 N3 N4

E2 N1 N2 N4

E3 N2 N4 N5

Page 126: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

126

Partitioned Mesh

Element Surrounding Nodes

E1 N1 N3 N4

E2 N1 N2 N3

Element Surrounding Nodes

E1 N1 N2 N3Shared Nodes

A B

N2 N1

N4 N3

Page 127: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

127

FEM Mesh: Node Communication

Summing forces from other processors only takes one call:

FEM_Update_field

Similar call for updating ghost regions

Page 128: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

128Robert Fielder, Center for Simulation of Advanced Rockets

FEM Framework Users: CSAR

• Rocflu fluids solver, a part of GENx

• Finite-volume fluid dynamics code

• Uses FEM ghost elements• Author: Andreas

Haselbacher

Page 129: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

129

FEM Experience

• Previous: – 3-D volumetric/cohesive crack propagation code

• (P. Geubelle, S. Breitenfeld, et. al)– 3-D dendritic growth fluid solidification code

• (J. Dantzig, J. Jeong)– Adaptive insertion of cohesive elements

• Mario Zaczek, Philippe Geubelle• Performance data

– Multi-Grain contact (in progress)• Spandan Maiti, S. Breitenfield, O. Lawlor, P. Guebelle• Using FEM framework and collision detection

– NSF funded project

– Space-time meshes

Did initial parallelization in 4 days

Page 130: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

130

Performance data: ASCI Red

Mesh with

3.1 million

elements

Speedup of 1155 on 1024 processors.

Page 131: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

131

Dendritic Growth• Studies evolution of

solidification microstructures using a phase-field model computed on an adaptive finite element grid

• Adaptive refinement and coarsening of grid involves re-partitioning

Jon Dantzig et al with O. Lawlor and Others from PPL

Page 132: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

132

“Overhead” of Multipartitioning

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128 256 512 1024 2048

Number of Chunks Per Processor

Tim

e (

Se

co

nd

s) p

er

Ite

rati

on

Conclusion: Overhead of virtualization is small, and in fact it benefits by creating automatic

Page 133: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

133

Parallel Collision Detection• Detect collisions (intersections) between objects

scattered across processors

Approach, based on Charm++ ArraysOverlay regular, sparse 3D grid of voxels (boxes)

Send objects to all voxels they touch

Collide objects within each voxel independently and collect results

Leave collision response to user code

Page 134: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

134

Parallel Collision Detection Results: 2s per polygon;

Good speedups to 1000s of processors

ASCI Red, 65,000 polygons per processor.

(scaled problem)

Up to 100 million polygons

This was a significant improvement over the state-of-art.

Made possible by virtualization, and

Asynchronous, as needed, creation of voxels

Localization of communication: voxel often on the same processor as the contributing polygon

Page 135: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

135

Summary

• Processor virtualization is a powerful techniques• Charm++/AMPI are production quality systems

– with bells and whistles

– Can scale to petaFLOPS class machines

• Domain-specific frameworks– Can raise the level of abstraction and promote reuse

– Unstructured Mesh framework

• Next : compiler support, new coordination mechanisms• Software available

– http://charm.cs.uiuc.edu

Page 136: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

136

Optimizing for Communication Patterns

• The parallel-objects Runtime System can observe, instrument, and measure communication patterns– Communication is from/to objects, not processors

– Load balancers can use this to optimize object placement

– Communication libraries can optimize

• By substituting most suitable algorithm for each operation

• Learning at runtime

V. Krishnan, MS Thesis, 1996

Page 137: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

137

Molecular Dynamics: Benefits of avoiding barrier

• In NAMD:– The energy reductions were made asynchronous

– No other global barriers are used in cut-off simulations

• This came handy when:– Running on Pittsburgh Lemieux (3000 processors)

– The machine (+ our way of using the communication layer) produced unpredictable, random delays in communication

• A send call would remain stuck for 20 ms, for example

• How did the system handle it?– See timeline plots

Page 138: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

138

Golden Rule of Load Balancing

Golden Rule: It is ok if a few processors idle, but avoid having processors that are overloaded with work

Finish time = max{Time on I’th processor}Excepting data dependence and communication overhead issues

Example: 50,000 tasks of equal size, 500 processors:

A: All processors get 99, except last 5 gets 100+99 = 199

OR, B: All processors have 101, except last 5 get 1

Fallacy: objective of load balancing is to minimize variance in load across processors

Identical variance, but situation A is much worse!

Page 139: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

139

Amdahls’s Law and grainsize• Before we get to load balancing:• Original “law”:

– If a program has K % sequential section, then speedup is limited to 100/K.

• If the rest of the program is parallelized completely

• Grainsize corollary:– If any individual piece of work is > K time units, and the sequential

program takes Tseq , • Speedup is limited to Tseq / K

• So:– Examine performance data via histograms to find the sizes of remappable

work units– If some are too big, change the decomposition method to make smaller

units

Page 140: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

140

Grainsize: LeanMD for Blue Gene/L

• BG/L is a planned IBM machine with 128k processors• Here, we need even more objects:

– Generalize hybrid decomposition scheme

• 1-away to k-away 2-away :

cubes are half the size.

Page 141: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

141

5000 vps

76,000 vps

256,000 vps

Page 142: 1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale  Parallel Programming

142

New strategy