Gengbin Zheng
Parallel Programming Laboratory
University of Illinois at Urbana-Champaign
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 2
AMPI: MotivationChallenges
New generation parallel applications are: Dynamically varying: load shifting during execution Adaptively refined Composed of multi-physics modules
Typical MPI Implementations:Not naturally suitable for dynamic applicationsAvailable processor set may not match algorithm
Alternative: Adaptive MPI (AMPI)MPI & Charm++ virtualization: VP (“Virtual Processors”)
AMPI: Adaptive MPI 3
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 4
AMPI: OverviewVirtualization: MPI ranks → Charm++ threads
AMPI: Adaptive MPI 5
Real Processors
MPI “tasks”
Implemented as user-level migratable threads
( VPs: virtual processors )
AMPI: Overview (cont.)AMPI Execution Model:
AMPI: Adaptive MPI 6
• Multiple user-level threads per process
• Typically, one process per physical processor
• Charm++ Scheduler coordinates execution
• Threads (VPs) can migrate across processors
• Virtualization ratio: R = #VP / #P (over-decomposition)
Charm++ Scheduler
P=1 , VP=4
AMPI: Overview (cont.)AMPI’s Over-Decomposition in Practice
AMPI: Adaptive MPI 7
MPI: P=4, ranks=4 AMPI: P=4, VP=ranks=16
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 8
Benefits of AMPIOverlap between Computation/Communication
AMPI: Adaptive MPI 9
• Automatically achieved • When one thread blocks for a
message, another thread in the same processor can execute
• Charm++ Scheduler picks next thread among those that are ready to run Charm++ Scheduler
Benefits of AMPI (cont.)Potentially Better Cache Utilization
AMPI: Adaptive MPI 10
Gains occur when subdomain is accessed repeatedly (e.g. by multiple functions, called in sequence)
12 might fit in
cache, but 3 might not fit
Benefits of AMPI (cont.)Thread Migration for Load Balancing
AMPI: Adaptive MPI 11
Migration of thread 13:
Benefits of AMPI (cont.)Load Balancing in AMPI: MPI_Migrate()
Collective operation informing the load balancer that the threads can be migrated, if needed, for balancing load
Easy to insert in the code of iterative applications Leverages Load-Balancing framework of Charm++Balancing decisions can be based on
Measured parameters: computation load, communication pattern
Application-provided information
AMPI: Adaptive MPI 12
Benefits of AMPI (cont.)Decoupling of Physical/Virtual Processors
AMPI: Adaptive MPI 13
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3
1
10
100
10 100 1000Procs
Ex
ec
Tim
e [s
ec
]
Native MPI AMPI
Benefits of AMPI (cont.)Asynchronous Implementation of Collectives
Collective operation is posted, returns immediatelyTest/wait for its completion; meanwhile, do useful work
e.g. MPI_Ialltoall( … , &req);/* other computation */MPI_Wait(req);
Other operations available: MPI_Iallreduce, MPI_Iallgather
Example: 2D FFT benchmark time (ms)
AMPI: Adaptive MPI 14
0 10 20 30 40 50 60 70 80 90 100
AMPI,4
AMPI,8
AMPI,16
Native MPI,4
Native MPI,8
Native MPI,161D FFT
All-to-all
Wait
Motivation for Collective Communication Optimization
0
100
200
300
400
500
600
700
800
900
76 276 476 876 1276 1676 2076 3076 4076 6076 8076
Message Size (Bytes)
Tim
e (
ms)
Mesh
Mesh Compute
AMPI: Adaptive MPI 15
Time breakdown of an all-to-all operation using Mesh library
Computation is only a small proportion of the elapsed time A number of optimization techniques are developed to
improve collective communication performance
Benefits of AMPI (cont.)Fault Tolerance via Checkpoint/Restart
State of application checkpointed to disk or memoryCapable of restarting on different number of physical
processors!Synchronous checkpoint, collective call:
In-disk: MPI_Checkpoint(DIRNAME) In-memory: MPI_MemCheckpoint(void)
Restart: In-disk: charmrun +p4 prog +restart DIRNAME In-memory: automatic restart upon failure detection
AMPI: Adaptive MPI 16
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 17
Converting MPI Codes to AMPI
AMPI needs its own initialization, before user-codeFortran program entry-point: MPI_Main
program pgm subroutine MPI_Main... ...end program end subroutine
C program entry-point is handled automatically, via mpi.h - include in same file as main() if absent
If the code has no global/static variables, this is all that is needed to convert!
AMPI: Adaptive MPI 18
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 19
Handling Global/Static Variables
Global and static variables are a problem in multi-threaded programs (similar problem in OpenMP):Globals/statics have a single instance per processThey become shared by all threads in the processExample:
AMPI: Adaptive MPI 20
Thread 1 Thread 2
var = myid (1)
MPI_Recv()
(block...)
b = var
var = myid (2)
MPI_Recv()
(block...)If var is a global/static, incorrect
value is read!
time
Handling Global/Static Variables (cont.)
• General Solution: Privatize variables in thread• Approaches:
a) Swap global variables
b) Source-to-source transformation via Photran
c) Use TLS scheme (in development)
Specific approach to use must be decided on a case-by-case basis
AMPI: Adaptive MPI 21
Handling Global/Static Variables (cont.)
First Approach: Swap global variablesLeverage ELF – Execut. & Linking Format (e.g. Linux)ELF maintains a Global Offset Table (GOT) for globalsSwitch GOT contents at thread context-switch Implemented in AMPI via build flag –swapglobals+ No source code changes needed+ Works with any language (C, C++, Fortran, etc)- Does not handle static variables- Context-switch overhead grows with num.variables
AMPI: Adaptive MPI 22
Handling Global/Static Variables (cont.)
Second Approach: Source-to-source transformMove globals/statics to an object, then pass it around Automatic solution for Fortran codes: PhotranSimilar idea can be applied to C/C++ codes+ Totally portable across systems/compilers + May improve locality and cache utilization+ No extra overhead at context-switch- Requires new implementation for each language
AMPI: Adaptive MPI 23
Handling Global/Static Variables (cont.)
Example of Transformation: C Program
AMPI: Adaptive MPI 24
Original Code: Transformed Code:
Handling Global/Static Variables (cont.)
Example of Photran Transformation: Fortran Prog.
AMPI: Adaptive MPI 25
Original Code: Transformed Code:
Handling Global/Static Variables (cont.)
Photran Transformation ToolEclipse-based IDE, implemented in Java Incorporates automatic refactorings for Fortran codesOperates on “pure” Fortran 90 programsCode transformation infrastructure:
Construct rewriteable ASTs ASTs are augmented with binding information
AMPI: Adaptive MPI 26
Source: Stas Negara & Ralph Johnson
http://www.eclipse.org/photran/
Handling Global/Static Variables (cont.)
AMPI: Adaptive MPI 27
Source: Stas Negara & Ralph Johnson
http://www.eclipse.org/photran/
Photran-AMPI
GUI:
NAS Benchmark
AMPI: Adaptive MPI28
FLASH Results
AMPI: Adaptive MPI29
Handling Global/Static Variables (cont.)
Third Approach: TLS scheme (Thread-Local-Store)Originally employed in kernel threadsIn C code, variables are annotated with __threadModified/adapted gfortran compiler available+ Handles uniformly both globals and statics + No extra overhead at context-switch- Although popular, not yet a standard for compilers- Current Charm++ support only for x86 platforms
AMPI: Adaptive MPI 30
Handling Global/Static Variables (cont.)
Summary of Current Privatization Schemes: Program transformation is very portableTLS scheme may become supported on Blue Waters,
depending on work with IBM
AMPI: Adaptive MPI 31
Privatiz.
Scheme
X86 IA64
Opteron
MacOS
IBM Power
SUN IBM BG/P
CrayXT
Windows
Prog. Trans
f.
Yes Yes Yes Yes Yes Yes Yes Yes Yes
Swap Globa
ls
Yes Yes Yes No No Maybe
No No No
TLS Yes Maybe
Yes No Maybe
Maybe
No Yes Maybe
NAS Benchmark
AMPI: Adaptive MPI32
FLASH Results FLASH is a parallel, multi-
dimensional code used to study astrophysical fluids.
Many astrophysical environments are highly turbulent, and have structure on scales varying from large scale, like galaxy clusters, to small scale, like active galactic nuclei, in the same system.
AMPI: Adaptive MPI 33
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 34
Object Migration
AMPI: Adaptive MPI35
Object MigrationHow do we move work between processors?Application-specific methods
E.g., move rows of sparse matrix, elements of FEM computation
Often very difficult for applicationApplication-independent methods
E.g., move entire virtual processorApplication’s problem decomposition doesn’t change
AMPI: Adaptive MPI 36
How to Migrate a Virtual Processor? Move all application state to new processorStack Data
Subroutine variables and callsManaged by compiler
Heap DataAllocated with malloc/freeManaged by user
Global VariablesOpen files, environment variables, etc. (not handled
yet!)
AMPI: Adaptive MPI 37
Stack DataThe stack is used by the compiler to track function
calls and provide temporary storageLocal VariablesSubroutine ParametersC “alloca” storage
Most of the variables in a typical application are stack data
AMPI: Adaptive MPI 38
Migrate Stack DataWithout compiler support, cannot change stack’s
addressBecause we can’t change stack’s interior pointers
(return frame pointer, function arguments, etc.)Solution: “isomalloc” addresses
Reserve address space on every processor for every thread stack
Use mmap to scatter stacks in virtual memory efficiently
Idea comes from PM2
AMPI: Adaptive MPI 39
Migrate Stack Data
AMPI: Adaptive MPI 40
Thread 2 stack
Thread 3 stack
Thread 4 stack
Processor A’s Memory
Code
Globals
Heap
0x00000000
0xFFFFFFFF
Thread 1 stack
Code
Globals
Heap
0x00000000
0xFFFFFFFF
Processor B’s Memory
Migrate Thread 3
Migrate Stack Data
AMPI: Adaptive MPI 41
Thread 2 stack
Thread 4 stack
Processor A’s Memory
Code
Globals
Heap
0x00000000
0xFFFFFFFF
Thread 1 stack
Code
Globals
Heap
0x00000000
0xFFFFFFFF
Processor B’s Memory
Migrate Thread 3 Thread 3 stack
Migrate Stack DataIsomalloc is a completely automatic solution
No changes needed in application or compilersJust like a software shared-memory system, but with proactive
pagingBut has a few limitations
Depends on having large quantities of virtual address space (best on 64-bit) 32-bit machines can only have a few gigs of isomalloc stacks across the
whole machineDepends on unportable mmap
Which addresses are safe? (We must guess!) What about Windows? Blue Gene?
AMPI: Adaptive MPI 42
Heap DataHeap data is any dynamically allocated data
C “malloc” and “free”C++ “new” and “delete”F90 “ALLOCATE” and “DEALLOCATE”
Arrays and linked data structures are almost always heap data
AMPI: Adaptive MPI 43
Migrate Heap DataAutomatic solution: isomalloc all heap data just like
stacks!“-memory isomalloc” link optionOverrides malloc/freeNo new application code neededSame limitations as isomalloc
Manual solution: application moves its heap dataNeed to be able to size message buffer, pack data into
message, and unpack on other side“pup” abstraction does all three
AMPI: Adaptive MPI 44
Migrate Heap Data: PUPSame idea as MPI derived types, but datatype
description is code, not dataBasic contract: here is my data
Sizing: counts up data sizePacking: copies data into messageUnpacking: copies data back outSame call works for network, memory, disk I/O ...
Register “pup routine” with runtimeF90/C Interface: subroutine calls
E.g., pup_int(p,&x);C++ Interface: operator| overloading
E.g., p|x;
AMPI: Adaptive MPI 45
Migrate Heap Data: PUP BuiltinsSupported PUP Datatypes
Basic types (int, float, etc.) Arrays of basic types Unformatted bytes
Extra Support in C++ Can overload user-defined types
Define your own operator| Support for pointer-to-parent class
PUP::able interface Supports STL vector, list, map, and string
“pup_stl.h” Subclass your own PUP::er object
AMPI: Adaptive MPI 46
Migrate Heap Data: PUP C++ Example
AMPI: Adaptive MPI 47
#include “pup.h”#include “pup_stl.h”
class myMesh { std::vector<float> nodes; std::vector<int> elts;public: ... void pup(PUP::er &p) { p|nodes; p|elts; }};
Migrate Heap Data: PUP C Example
AMPI: Adaptive MPI 48
struct myMesh { int nn,ne; float *nodes; int *elts;};
void pupMesh(pup_er p,myMesh *mesh) { pup_int(p,&mesh->nn); pup_int(p,&mesh->ne); if(pup_isUnpacking(p)) { /* allocate data on arrival */ mesh->nodes=new float[mesh->nn]; mesh->elts=new int[mesh->ne]; } pup_floats(p,mesh->nodes,mesh->nn); pup_ints(p,mesh->elts,mesh->ne); if (pup_isDeleting(p)) { /* free data on departure */ deleteMesh(mesh); }}
Migrate Heap Data: PUP F90 Example
AMPI: Adaptive MPI 49
TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: eltsEND TYPE
SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh);END SUBROUTINE
Global DataGlobal data is anything stored at a fixed place
C/C++ “extern” or “static” dataF77 “COMMON” blocksF90 “MODULE” data
Problem if multiple objects/threads try to store different values in the same place (thread safety)Compilers should make all of these per-thread; but they
don’t!Not a problem if everybody stores the same value
(e.g., constants)
AMPI: Adaptive MPI 50
Migrate Global DataAutomatic solution: keep separate set of globals for each
thread and swap “-swapglobals” compile-time optionWorks on ELF platforms: Linux and Sun
Just a pointer swap, no data copying needed Idea comes from Weaves framework
One copy at a time: breaks on SMPs
Manual solution: remove globalsMakes code threadsafeMay make code easier to understand and modifyTurns global variables into heap data (for isomalloc or pup)
AMPI: Adaptive MPI 51
How to Migrate a Virtual Processor? Move all application state to new processorStack Data
Automatic: isomalloc stacksHeap Data
Use “-memory isomalloc” -or-Write pup routines
Global VariablesUse “-swapglobals” -or-Remove globals entirely
AMPI: Adaptive MPI 52
Running AMPI ProgramsBuild Charm++/AMPI if not yet available:
./build AMPI <version> <options> (see README for details)
Build application with AMPI’s scripts: <charmdir>/bin/ampicc –o prog prog.c
Run the application via charmrun:ampirun –np K charmrun +pK prog
MPI’s machinefile ≈ Charm’s nodelist file
+p option: number of physical processors to use
+vp option: number of virtual processors to use
AMPI: Adaptive MPI 53
Running AMPI Programs (cont.)
Multiple VP P mappings are available:charmrun +p2 prog +vp8 +mapping <map>
RR_MAP: Round-Robin (cyclic) BLOCK_MAP: Block (default mapping) PROP_MAP: Proportional to processors’ speeds
Example: VP=8, P=2, map=RR_MAPP[0]: VPs 0,2,4,6; P[1]: VPs 1,3,5,7
Other mappings can be easily addedSimple AMPI-lib changes needed (examples available)Best mapping depends on the application
AMPI: Adaptive MPI 54
Running AMPI Programs (cont.)
Optional: Build application with “isomalloc” ampicc –o prog prog.c -memory isomallocSpecial memory allocator, helps in migration
Run the application with modified stack sizes:ampirun –np K prog +vpM +tcharm_stacksize 1000Size specified in Bytes, valid for each threadDefault size: 1 MBCan be increased or decreased via command-line
AMPI: Adaptive MPI 55
Running AMPI Programs (cont.)
Load Balancer Use: Link program with LB modules:ampicc –o prog prog.c –module EveryLB
Run program with one of the available balancers:ampirun -np4 prog +vp16 +balancer <SelectedLB>e.g. GreedyLB, GreedyCommLB, RefineLB, etc, etc
It is possible to define when to collect information for the load balancer, during execution:LBTurnInstrumentOn() and LBTurnInstrumentOff() callsUsed with +LBOff, +LBCommOff command-line options
AMPI: Adaptive MPI 56
Running AMPI Programs (cont.)
Performance Analysis: Projections Tool (GUI) ampicc –o prog prog.c -tracemode projections
AMPI: Adaptive MPI 57
Post-mortem visualization of performance data:
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 58
AMPI StatusCompliance to MPI-1.1 Standard
Missing: error handling, profiling interfacePartial MPI-2 support:
Some new functions implemented when neededROMIO integrated for parallel I/OMajor missing features: dynamic process management,
language bindingsMost missing features are documented
Tested periodically via MPICH-1 test-suite
AMPI: Adaptive MPI 59
AMPI ApplicationsRocstar
Rocket simulationFractography3DFLASHBRAM
AMPI: Adaptive MPI 60
OutlineMotivationAMPI: Overview Benefits of AMPI Converting MPI Codes to AMPIHandling Global/Static VariablesRunning AMPI ProgramsAMPI StatusAMPI References, Conclusion
AMPI: Adaptive MPI 61
AMPI ReferencesCharm++ site for manuals
http://charm.cs.uiuc.edu/manuals/Papers on AMPI
http://charm.cs.uiuc.edu/research/ampi/index.shtml#Papers
AMPI Source Code: part of Charm++ distribution
http://charm.cs.uiuc.edu/download/AMPI’s current funding support (indirect):
NSF/NCSA Blue Waters (Charm++, BigSim)DoE – Colony2 project (Load Balancing, Fault Tolerance)
AMPI: Adaptive MPI 62
ConclusionAMPI makes exciting features from Charm++
available for many MPI applications!VPs in AMPI are used in BigSim to emulate
processors of future machines – see next talk…We support AMPI through our regular mailing list:
[email protected] on AMPI is always welcome Thank You!
Questions ?
AMPI: Adaptive MPI 63