quentin f. stout christiane jablonowski · atm networks, digital multimedia parallel computers can...
TRANSCRIPT
Parallel Computing 101
Quentin F. Stout Christiane Jablonowski
University of Michigan
Copyright c© 2008
Stout and Jablonowski – p. 1/324
Organization
Part I Introduction, TerminologyExample (crash simulation)Speedup and Efficiency, Amdahl’s LawArchitecturesDistributed Memory Communication, MPIParallelizing Serial Programs ILoad Balancing IShared Memory, OpenMP
Stout and Jablonowski – p. 2/324
Organization cont.
Part II Hybrid ComputingVector Computing, Climate ModelingParallelizing Serial Programs IILoad Balancing IIData Intensive ComputingPerformance Improvement, ToolsUsing and Buying Parallel SystemsReview, Wrapup
Stout and Jablonowski – p. 3/324
INTRODUCTION
In this part we introduce parallel computing and someuseful terminology. We examine many of the variations insystem architecture, and how they affect the programmingoptions.
We will look at a representative example of a largescientific/engineering code, and examine how it wasparallelized. We also consider some additional examples.
Stout and Jablonowski – p. 4/324
Why use Parallel Computers?
Parallel computers can be the only way to achievespecific computational goals at a given time.
PetaFLOPS and Petabytes for Grand Challengeproblemskilo-transactions per second for search engines,ATM networks, digital multimedia
Parallel computers can be the cheapest or easiest wayto achieve a specific computational goal at a given time:e.g., cluster computers made from commodity parts.
Parallel computers can be made highly fault-tolerantnonstop computing at nuclear reactorsweb search
Stout and Jablonowski – p. 5/324
Why Parallel Computing — continued
The universe is inherently parallel, so parallel models fitit best.
Physical processes occur in parallel:weather, galaxy formation, nuclear reactions,epidemics, . . .
Social/work processes occur in parallel:ant colonies, wolf packs, assembly lines, stockexchange, tutorials, . . .
Stout and Jablonowski – p. 6/324
Basic Terminology and Concepts
Caveats
The definitions are fuzzy, many terms are notstandardized, definitions often change over time.
Many algorithms, software, and hardware systems donot match the categories, often blending approaches.
No attempt to cover all models and aspects of parallelcomputing. For example, quantum computing notincluded.
Stout and Jablonowski – p. 7/324
Parallel Computing Thesaurus
Parallel Computing Solving a task by simultaneous use ofmultiple processors, all components of a unifiedarchitecture.
Embarrassingly Parallel Solving many similar, butindependent, tasks. E.g., parameter sweeps.
Symmetric Multiprocessing (SMP) Multiple processors sharinga single address space and access to all resources.
Multi-core Processors Multiple processors (cores) on a singlechip. Aka many-core. Heterogeneous multi-core chipswith GPU being developed.
Cluster Computing Hierarchical combination of commodityunits (processors or SMPs) to build parallel system.
Stout and Jablonowski – p. 8/324
Thesaurus continued
Supercomputing Use of the fastest, biggest machines tosolve large problems. Historically vector computers, butnow are parallel or parallel/vector.
High Performance Computing Solving problems viasupercomputers + fast networks + visualization.
Pipelining Breaking a task into steps performed by differentunits, with inputs streaming through, much like anassembly line.
Vector Computer Operation such as multiply broken intoseveral steps and applied to a stream of operands(pipelining with “vectors”).
Stout and Jablonowski – p. 9/324
Pipelining, Detroit Style
Stout and Jablonowski – p. 10/324
Who Uses Supercomputers?
Historically, the military (nuclear simulations, cryptography).Weather forecasting was the civilian application.
These continue to be major users but now many morecivilian users.
The following charts are from the Top 500 list, showing thestatus as of June. The newest list has just been announcedand is on the Top500 website:
http://www.top500.org
Stout and Jablonowski – p. 11/324
Top500: Performance
Stout and Jablonowski – p. 12/324
Top500: Application Systems
Stout and Jablonowski – p. 13/324
Top500: Architecture Systems
Stout and Jablonowski – p. 14/324
Top500: Vendor Systems
Stout and Jablonowski – p. 15/324
CRASH SIMULATION
A greatly simplified model, based on parallelizing crashsimulation for Ford Motor Company. Such simulations savea significant amount of money and time compared to testingreal cars.
This example illustrates various phenomena which arecommon to a great many simulations and other large-scaleapplications.
Stout and Jablonowski – p. 16/324
Finite Element Representation
Car is modeled by a triangulated surface (theelements).
The simulation consists of modeling the movement ofthe elements during each time step, incorporating theforces on them to determine their new position.
In each time step, the movement of each elementdepends on its interaction with the other elements that itis physically adjacent to.
Stout and Jablonowski – p. 17/324
The Car of the Future
Stout and Jablonowski – p. 18/324
Basic Serial Crash Simulation
1 For all elements
2 Read State(element), Properties(element),
Neighbor_list(element)
3 For time=1 to end_of_simulation
4 For element = 1 to num_elements
5 Compute State(element) for next time step,
based on previous state of element and its
neighbors, and on properties of element
Periodically State is stored on disk for later visualization.Stout and Jablonowski – p. 19/324
Simple approach to parallelization
Parallel computer based on PC-like processors linked witha fast network, where processors communicate viamessages. Distributed memory or message-passing .
Cannot parallelize time, so parallelize space.
Distribute elements to processors, each processor updatesthe positions of the elements it contains: owner computes .
All machines run the same program: SPMD , singleprogram multiple data.
SPMD is the dominant form of parallel computing.
Stout and Jablonowski – p. 20/324
A Distributed Car
Stout and Jablonowski – p. 21/324
Basic Parallel Version
Concurrently for all processors P
1 For all elements assigned to P
2 Read State(element), Properties(element),
Neighbor-list(element)
3 For time=1 to end-of-simulation
4 For element = 1 to num-elements-in-P
5 Compute State(element) for next time step,
based on previous state of element and its
neighbors, and on properties of element
Stout and Jablonowski – p. 22/324
Software Engineering Aspects
Most parallel code the same as, or similar to, serial code,reducing parallel development and life-cycle costs, andhelping keep parallel and serial versions compatible.
Life-cycle costs are often overlooked until it is too late!
Note that high-level structure same as serial version: asequence of steps. The sequence is a serial construct, butsteps are performed in parallel.
Stout and Jablonowski – p. 23/324
Some Basic Questions: Allocation
How are elements assigned to processors?
Stout and Jablonowski – p. 24/324
Some Basic Questions: Allocation
How are elements assigned to processors?
Typically element assignment determined by serialpreprocessing, using domain decompositionapproaches (load-balancing) described later.
Stout and Jablonowski – p. 24/324
Separation?
How does processor keep track of adjacency info forneighbors in other processors?
Stout and Jablonowski – p. 25/324
Separation?
How does processor keep track of adjacency info forneighbors in other processors?
Use ghost cells (halo ) to copy remote neighbors, addtranslation table to keep track of their location andwhich local elements copied elsewhere.
Stout and Jablonowski – p. 25/324
Ghost Cells
Stout and Jablonowski – p. 26/324
Update?
How does a processor use State(neighbor) when it doesnot contain the neighbor element?
Stout and Jablonowski – p. 27/324
Update?
How does a processor use State(neighbor) when it doesnot contain the neighbor element?
Could request state information from processorcontaining the neighbor. However, more efficient if thatprocessor sends it.
Stout and Jablonowski – p. 27/324
Coding and Correctness?
How does one manage the software engineering of theparallelization process?
Stout and Jablonowski – p. 28/324
Coding and Correctness?
How does one manage the software engineering of theparallelization process?
Utilize an incremental parallelization approach.
Constantly check test cases to make sure answerscorrect.
Stout and Jablonowski – p. 28/324
Efficiency?
How do we evaluate the success of the parallelization, andif not successful, how do we improve it?
Stout and Jablonowski – p. 29/324
Efficiency?
How do we evaluate the success of the parallelization, andif not successful, how do we improve it?
Evaluate via speedup or efficiency metrics, improve viaprofiling, iterative refinement.
Stout and Jablonowski – p. 29/324
Evaluating Parallel Programs
An important component of effective parallel computing isdetermining whether the program is performing well. If it isnot running efficiently, or cannot be scaled to the targetnumber of processors, then one needs to determine thecauses of the problem and develop better approaches.
Stout and Jablonowski – p. 30/324
Definitions
For a given problem A, let
SerTime(n) = Time of best serial algorithm to solve A forinput of size n.
ParTime(n,p) = Time of the parallel algorithm+architecture tosolve A for input of size n, using p processors.
Note that SerTime(n) ≤ ParTime(n,1).
Speedup: SerTime(n) / ParTime(n,p)
Work (cost): p · ParTime(n,p)
Efficiency: SerTime(n) / [p · ParTime(n,p)]
Stout and Jablonowski – p. 31/324
In general, expect:
0 < Speedup ≤ p
Serial Work ≤ Parallel Work < ∞0 < Efficiency ≤ 1
Technically, speedup is linear if there is a constant c > 0 sothat speedup is at least c · p. However, many use this termto mean c = 1.
Always involves some restriction on relationship of p and n,e.g., p ≤ n, or p =
√n.
Stout and Jablonowski – p. 32/324
Observed Speedup
Number of Processors
S
p
e
e
d
u
p
Per
fect
Occasional
Common
Stout and Jablonowski – p. 33/324
Superlinear Speedup
Very rare. Some reasons for speedup > p (efficiency > 1)
Parallel computer has p times as much RAM so higherfraction of program memory in RAM instead of disk.An important reason for using parallel computers
In developing parallel program a better algorithm wasdiscovered, older serial algorithm was not best possible.A useful side-effect of parallelization
Parallel computer is solving slightly different, easierproblem, or providing slightly different answer.Questionable practice
Stout and Jablonowski – p. 34/324
Amdahl’s Law
Amdahl [1967] noted: given a program, let f be fraction oftime spent on operations that must be performed serially.Then for p processors,
Speedup(p) ≤ 1
f + (1− f)/p.
(Right hand side assumes perfect parallelization of (1-f) part of program)
Thus no matter how many processors are used:
Speedup ≤ 1/f
Unfortunately, typically f was 10 – 20%
Stout and Jablonowski – p. 35/324
Useful rule of thumb:
If maximal possible speedup is S, then Sprocessors run at about 50% efficiency.
Stout and Jablonowski – p. 36/324
Maximal Possible Speedup
1 2 4 8 16
32 64
128 256
512 1024
Processors
1
2
4
8
16
32
64
128
256
512
1024
Spe
edup
f=0.1 f=0.01 f=0.001
Stout and Jablonowski – p. 37/324
Maximal Possible Efficiency
1 2 4 8 16
32 64
128 256
512 1024
Processors
0.0
0.2
0.4
0.6
0.8
1.0
1.2 E
ffici
ency
f=0.1 f=0.01 f=0.001
Stout and Jablonowski – p. 38/324
Amdahl Was an Optimist
Parallelization usually adds work, typically communication,which reduces speedup.
For example, crash simulation typically runs for a fixedsimulated time interval. Due to the physics of the situation,if use n finite elements, number of time steps grows like√
n, so serial processor time grows like
C1 · n1.5
for some C1 > 0.
Stout and Jablonowski – p. 39/324
Additional Parallel Communication
Suppose use p processors. Every time step processorsreceive and send information about border elements. Thereis also periodic global communication of total energy,contact, etc.
For simple approaches, communication time grows like
√n
(
C2 · p + C3
√
n/p)
, C2, C3 > 0
Stout and Jablonowski – p. 40/324
Effect of Communication
Suppose C2 = C1 = 10 and C3 = 1. Then for n = 1000 weget the following speedup.
1 2 4 8 16
32 64
128 256
512 1024
Processors
0
5
10
15
20 S
peed
up
Stout and Jablonowski – p. 41/324
Amdahl was a Pessimist
Amdahl convinced many that general-purpose parallelcomputing was not viable. Fortunately, we can skirt the law.
Algorithm: May be new algorithms with much smallervalues of f — Necessity is the mother of invention.
Memory hierarchy: Possibility more time spent in RAM thandisk — Superlinear Speedup.
Scaling: Usually time spent in serial portion of code is adecreasing fraction of the total time as problem sizeincreases — Scaling.
Stout and Jablonowski – p. 42/324
Common Program Structure
Serial, grows slowly with n
Serial, grows slowly with n
Parallelizable loop, grows with n
Parallelizable loop within loopgrows very rapidly with n
Serial, fixed time
Sometimes serialportions grow withproblem size butmuch slower thanthe total time.
I.e., Amdahl’s“f” decreases asn increases
Stout and Jablonowski – p. 43/324
Scaling
For such programs, can often exploit large parallelmachines by scaling the problems to larger instances.
To illustrate, use a model like the crash simulation
SerT ime(n) = 10 · n1.5
and the time for p parallel processors grows like
ParT ime(n, p) = 10 · n1.5/p + 10 · p√
n + n/√
p
Stout and Jablonowski – p. 44/324
Fixed Size per Processor
Fixing the amount of data per processor usually giveshighest efficiency possible, hence it is commonly cited.Called weak scaling .
Suppose each processor can hold 1000 elements.Constant Size per Processor
1 2 4 8 16
32 64
128 256
512 1024
Processors
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Effi
cien
cy
Stout and Jablonowski – p. 45/324
Fixed TimeFix time, find largest problem solvable. Commonly used inevaluating database servers, transactions per second.[Gustafson 1988] considered this for general computing.
Fix time to be SerTime(1000).Constant Time
1 2 4 8 16
32 64
128 256
512 1024
Processors
4000
8000
12000 16000
n
Stout and Jablonowski – p. 46/324
Fixed Efficiency
Fix efficiency, find smallest problem needed to achieve thatefficiency (isoefficiency analysis).
For example, for 90% efficiency:Constant Efficiency of 0.9
1 2 4 8 16
32 64
128 256
512 1024
Processors
100
1000
10000
100000
1000000
10000000
n
Stout and Jablonowski – p. 47/324
Scalability
Linear speedup is very rare, due to communicationoverhead, load imbalance, algorithm/architecture mismatch,etc.
Several attempts have been made to give definitions forscalable architectures, algorithms, or algorithm-architecture combinations. However, for most users, theimportant question is:
Have I achieved acceptable performance on mysoftware/hardware system for a suitable range of data andmachine sizes?
Stout and Jablonowski – p. 48/324
ARCHITECTURAL TAXONOMIES
These classifications provide ways to think about problemsand their solution.
The classifications are in terms of hardware, but there arenatural software analogues.
Note: many systems blend approaches, and do not exactlycorrespond to the classifications.
Stout and Jablonowski – p. 49/324
Flynn’s Instruction/Data Taxonomy
[Flynn, 1966] At any point in time can have{
S
M
}
I
{
S
M
}
D
SI Single Instruction: All processors execute the sameinstruction. Usually involves a central controller.
MI Multiple Instruction: Different processors may beexecuting different instructions.
SD Single Data: All processors are operating on the samedata.
MD Multiple Data: Different processors may be operating ondifferent data.
Stout and Jablonowski – p. 50/324
SISD: standard serial computer and program.
MISD is rare — some extreme fault-tolerance schemes,using different computers and programs to operate onsame input data, are of this type.
Almost all parallel computers are MIMD.
SIMD: there used to be companies that made suchsystems (Thinking Machines’ Connection Machine wasthe most famous).
Vector computing is a form of SIMD.
Stout and Jablonowski – p. 51/324
Instructions
A SIMD System
Processors, with data
Controller, with program
Stout and Jablonowski – p. 52/324
SIMD Software
Data parallel software — do the same thing to all elementsof a structure (e.g., many matrix algorithms). Easy to writeand understand. Unfortunately, difficult to apply to complexproblems (as were the SIMD machines).
SPMD, Single Program Multiple Data : can be viewed as anextension of the SIMD approach to programming for MIMDsystems.
Stout and Jablonowski – p. 53/324
Memory Systems: Distributed Memory
All memory is associated with processors.
To retrieve information from another processor’smemory a message must be sent over the network tothe home processor. Usually organize program so thatthe owner sends it to the requestor before being asked.
Advantages:Memory is scalable with number of processorsEach processor has rapid access to its own memorywithout interference or cache coherency problemsCost effective and easier to build: can usecommodity parts
Stout and Jablonowski – p. 54/324
Disadvantages
Programmer is responsible for many of the details of thecommunication, easy to make mistakes.
May be difficult to distribute the data structures, oftenneed to revise them to add additional pointers.
Stout and Jablonowski – p. 55/324
Memory Systems: Shared Memory
Global memory space, accessible by all processors
Processors may have local memory to hold copies ofsome global memory.
Consistency of these copies is usually maintained byhardware.
Advantages:Global address space is user-friendly, program maybe able to use global data structures efficiently andwith little modification.Data sharing between tasks is fast
Stout and Jablonowski – p. 56/324
Disadvantages
System may suffer from lack of scalability betweenmemory and CPUs. Adding CPUs increases traffic onshared memory - to - CPU path. This is especially truefor cache coherent systems
Programmer is responsible for correct synchronization
Needs some special-purpose components.
Stout and Jablonowski – p. 57/324
Shared vs. Distributed
Network
Processor + Cache
Memory
Processor + Cache + Memory
Network
DISTRIBUTED MEMORY
SHARED MEMORY
Stout and Jablonowski – p. 58/324
Shared Memory Access Time
Two classes of SM systems based on memory access time:
Uniform Memory Access (UMA):
Most commonly represented by Symmetric Multi-processor Machines (SMP), identical processors
Equal access times to memory
Some systems are CC-UMA (cache coherent UMA): ifone processor updates a variable in shared memory, allthe other processors know about the update.
Stout and Jablonowski – p. 59/324
SM Access Time continued
Non-Uniform Memory Access (NUMA):
Often made by physically linking two or more SMPs
One SMP can directly access memory of another SMP(not message-passing)
Memory access times are not uniform, memory accessacross a link is slower
Cache coherent systems: CC-NUMA
Stout and Jablonowski – p. 60/324
Shared Memory on Distributed Memory
As we’ll see later, it is usually easier parallelize a programon a shared memory system.
However, most systems are distributed memory because ofthe cost advantages.
To gain both advantages people have investigated virtualshared memory , or global address space (GAS) , usingsoftware to simulate shared memory access.
Current projects include Unified Parallel C (UPC) andCo-Array Fortran.
Stout and Jablonowski – p. 61/324
Virtual Shared Memory Performance
Communication time in distributed memory machines isquite high. Thus virtual shared memory access is highlynonuniform, being vastly faster if the data is stored with theprocessor requesting it.
Because of these access delays, the performance of thesesystems is not good, even if reasonable care is taken, butmay be justified by greatly reduced programmer time.
Stout and Jablonowski – p. 62/324
Virtual Shared Memory Performance
Communication time in distributed memory machines isquite high. Thus virtual shared memory access is highlynonuniform, being vastly faster if the data is stored with theprocessor requesting it.
Because of these access delays, the performance of thesesystems is not good, even if reasonable care is taken, butmay be justified by greatly reduced programmer time.
Software and hardware models need not match, thoughthere are often performance problems when they don’t.
Stout and Jablonowski – p. 62/324
Communication Network
There are a many ways that the processors can beinterconnected but for the user the differences are usuallyminor. Two main classes that do have some impact:
Bus Processors (and memory) connected to a common busor busses, much like a local Ethernet.
Memory access fairly uniform, but not very scalabledue to contention.
Switching Network Processors (and memory) connected torouting switches like in telephone system.
Usually NUMA, blocking, though a cross-bar isnon-blocking (but a cross-bar is not scalable).
Stout and Jablonowski – p. 63/324
Networks
Switch Processor
MultistageInterconnect
Bus
Stout and Jablonowski – p. 64/324
Example: Symmetric Multiprocessors
Shared memory system, processors share work.
When a processor reads or writes to RAM, datatransported over a bus, local copy in processor cache.
Rules needed to ensure that different caches don’tcontain different values for the same memory locations(cache coherency). This is easier on bus-basedsystems than on more general interconnectionnetworks.
Because all processors use the same memory bus,there is limited scalability due to bus contention.
Multicore processors, which are SMPs, are becomingthe standard processors in all systems.
Stout and Jablonowski – p. 65/324
Low-Cost Parallel Systems
Systems built from commodity parts are becomingwidespread due to low cost and acceptable performance.
Clusters (NOW, Beowulfs, etc.): commodity processorboards with multicore processors and commodityinterconnects (e.g., Gigabit Ethernet). Often rackmounted.
SMPs: quite common as departmental servers.
Clusters of SMP nodes: rapidly gaining in importance,small SMPs available rack-mounted. Sometimes calledclumps.
Stout and Jablonowski – p. 66/324
However, communication on low-cost clusters often slow,typically due to software which relies on basic networkingstack. Some companies (Myrinet, Force10, etc.) markethigh-speed networks and special software to reduce this.
Constellations use much larger, much more expensive,shared memory units as nodes in a distributed memorysystem. Usually a high-performance interconnect is usedbetween the nodes.
Note: Many clusters are primarily used for embarrasingly parallelcomputation and do not need high-performance networking.
Stout and Jablonowski – p. 67/324
The Memory Hierarchy
The mismatch of processor speed and memory speedcauses a bottleneck. There is an inverse relationshipbetween memory speed and $/byte, and there are physicalconstraints on the size of memory. Thus memory arrangedin a hierarchy:
registers
cache (perhaps itself hierarchical)
RAM (“primary memory”)
disk (“secondary memory”)
tapes or CDs (“tertiary memory”)
Stout and Jablonowski – p. 68/324
Speed-Size Tradeoff
Cache MByte
Ram
Disk 10 millisec
Tape minute 100 TByte
100 GByte
GByte
nanosec
100 nanosecSpeed
Size
When moving between levels beyond the registers, anentire block is moved at once (cache lines, pages).Effective high-performance computing (serial or parallel)includes arranging data and program so that entire block isused while resident in the faster memory.
Stout and Jablonowski – p. 69/324
Multiprocessor Caching
Parallel computing compounds the memory hierarchy:remote memory is far slower to access than local memory.
Caching widely used, fetching blocks of data instead ofindividual items. Data fetched when referenced, sometimesprefetched before it is needed.
If data locality is high then effective memory accesstime is decreased
Reduces network traffic.
However, creates a cache coherence problem,
False sharing caused by cache lines can significantlydegrade performance.
Stout and Jablonowski – p. 70/324
MESSAGE PASSING
On distributed memory systems, also called messagepassing systems, communication is often an importantaspect of performance and correctness.
Stout and Jablonowski – p. 71/324
Communication Speed
On most distributed memory systems, messages arerelatively slow, with startup (latency) times taking thousandsof cycles (and far more for many clusters).
Typically, once the message has started, the additional timeper byte (bandwidth) is relatively small.
Stout and Jablonowski – p. 72/324
Measured Performance
For example, a 4.7 GHz IBM Power 6 (p575) processor,best case MPI messages (discussed later):
processor speed: 4700 cycles per microsecond (µsec),4 flops/cycle, 18800 flops per µsec.
MPI message latency, caused by software:≈ 1.3 µsec = 24,400 flops
message bandwidth, usually limited by hardware:≈ 2500 bytes per µsec = 7.5flops/byte
Your performance may vary!
Stout and Jablonowski – p. 73/324
Reducing Latency
Reducing the effect of high latency often important forperformance. Some useful approaches:
Reduce the number of messages by mappingcommunicating entities onto the same processor.
Combine messages having the same sender anddestination.
If processor P has data needed by processor Q, have Psend to Q, rather than Q first requesting it. P shouldsend as soon as data ready, Q should read as late aspossible to increase probability data has arrived.
Send Early, Receive Late, Don’t Ask but Tell.
Stout and Jablonowski – p. 74/324
Messages and Computations
Even when data is sent far in advance of its use, messagepassing can cause performance degradation. Can try tooverlap communication and calculation.
Unfortunately:
Many systems incapable of doing this.
Latency dominantly due to software, initiating messageties up processor.
Even with co-processor, memory bus may be tied up,interfering with main processor’s use of it.
Expensive communication systems try to ovecome theseproblems.
Stout and Jablonowski – p. 75/324
Deadlock
If messages blocking , i.e., if processor can’t proceed untilthe message is finished, then can reach deadlock , whereno processor can proceed.
Example: Processor A sends message to B while B sendsto A. If blocking sends, neither finishes until the otherfinishes receiving, but neither starts receiving until sendfinished.
This can be avoided by A doing send then receive, while Bdoes receive then send. However, often difficult tocoordinate when there are many processors.
Stout and Jablonowski – p. 76/324
Often easiest to prevent deadlock by non-blockingcommunication, where processor can send and proceedbefore receive is finished.
However, requires receiver buffer space which may fill,reducing to blocking case, and extra copying of messages,reducing performance.
Stout and Jablonowski – p. 77/324
Message Passing Interface — MPI
An important communication standard. We will show somesnippets of MPI to illustrate some of the issues, but MPI is amajor topic that we cannot address in detail. Fortunately,many programs need only a few MPI features. There aremany implementations of MPI:
MPICH homepage http://www-unix.mcs.anl.gov/mpi
Open MPI homepage http://www.open-mpi.org/
Stout and Jablonowski – p. 78/324
Some Reasons for Using MPI
Standardized, with process to keep it evolving.
Available on almost all parallel systems (free MPICH,Open MPI used on many clusters), with interfaces for Cand Fortran.
Supplies many communication variations and optimizedfunctions for a wide range of needs.
Supports large program development and integration ofmultiple modules.
Many powerful packages and tools based on MPI.
Stout and Jablonowski – p. 79/324
While MPI large (> 100 functions), usually need veryfew functions (6-10), giving gentle learning curve.
Various training materials, tools and aids for MPI.
Good introductory MPI tutorialhttp://www.llnl.gov/computing/tutorials/mpi/
Basic and advanced MPI tutorials, e.g. on I/O andone-sided communicationhttp://www-unix.mcs.anl.gov/mpi/tutorial/
Stout and Jablonowski – p. 80/324
While MPI large (> 100 functions), usually need veryfew functions (6-10), giving gentle learning curve.
Various training materials, tools and aids for MPI.
Good introductory MPI tutorialhttp://www.llnl.gov/computing/tutorials/mpi/
Basic and advanced MPI tutorials, e.g. on I/O andone-sided communicationhttp://www-unix.mcs.anl.gov/mpi/tutorial/
Writing MPI-based parallel codes helps preserve yourinvestment as systems change.
Stout and Jablonowski – p. 80/324
MPI Basics
The overwhelmingly most frequently used MPIcommands are variants of
MPI_SEND() to send data, andMPI_RECV() to receive it.
These function very much like write & read statements.
Point-to-point communication
MPI_SEND() and MPI_RECV() are blocking operations.
Blocking communication can be unsafe and may lead todeadlocks.
Stout and Jablonowski – p. 81/324
Blocking MPI Communication
MPI_SEND() does not complete until thecommunication buffer is empty
MPI_RECV() does not complete until thecommunication buffer is full
Send-recv handshake works for small messages, butmight fail for large messages
Allowable size of the message depends on MPIimplementation (buffer sizes), could also behardware-dependent
Even if it works, the data usually get copied into amemory buffer
Copies are slow (avoid), poor performance
Stout and Jablonowski – p. 82/324
Non-Blocking MPI Communication
Better solution: use non-blocking operations
MPI_ISEND()MPI_IRECV()MPI_WAIT()
The user can also check for the data at a later stage inthe program without waiting:
MPI_TEST()
Non-blocking operations boost the performance.
Other non-blocking send and receive operationsavailable.
Possible overlap of communication with computation.
However, few system can provide the overlap, oftenalready limited by the memory bandwidth.
Stout and Jablonowski – p. 83/324
MPI Initialization
Near the beginning of the program, include
#include "mpi.h"MPI_Init(&argc, &argv)MPI_Comm_rank(MPI_COMM_WORLD, &my_rank)MPI_Comm_size(MPI_COMM_WORLD,
&num_processors)
These help each processor determine its role in the overallscheme.
There is MPI_Finalize() at the end.
These 4 MPI functions, together with MPI send and receiveoperations, are already sufficient for simple applications.
Stout and Jablonowski – p. 84/324
MPI Example
Each processor sends value to proc. 0, which adds them.
0
1
2
3
4
5
6
7
8
Stout and Jablonowski – p. 85/324
Basic Program
initializeif (my_rank == 0){
sum = 0.0;for (source=1; source<num_procs; source++){
MPI_RECV(&value, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);
sum += value;}
} else {MPI_SEND(&value, 1, MPI_FLOAT, 0, tag,
MPI_COMM_WORLD);}finalize
Stout and Jablonowski – p. 86/324
Improving Performance
In the initial version, processor 0 received the messages inprocessor order. However, if processor 1 delayed sendingits message, then processor 0 would also be delayed.
For a more efficient version: modify MPI_RECV to
MPI_Recv(&value, 1, MPI_FLOAT,MPI_ANY_SOURCE, tag,MPI_COMM_WORLD, &status);
Now processor 0 can start processing messages as soonas any arrives.
Stout and Jablonowski – p. 87/324
Reduction Operations
Operations such as summing are common, combining datafrom every processor into a single value. These reductionoperations are so important that MPI provides directsupport for them, and parallelizing compilers recognizethem and generate efficient code.
Could replace all communication with
MPI_REDUCE(&value, &sum, 1, MPI_FLOAT,MPI_SUM, 0, MPI_COMM_WORLD)
Examples of Collective Operations:
MPI_SUM, MPI_MAX, MPI_MIN, MPI_PROD
MPI_LAND (logical and), MPI_LOR (logical or)
Stout and Jablonowski – p. 88/324
Collective Communication
The opposite of reduction is broadcast : one processorsends to all others.
Reduction, broadcast, and others are collectivecommunication operations, the next most frequentlyinvoked MPI routines after send and receive.
MPI collective communication routines improve clarity, runfaster, and reduce chance of programmer error.
Stout and Jablonowski – p. 89/324
Collective Communication
Broadcast
AP0
P3
P2
P1 P1
P0
P2
P3
P3
P2
P1 P1
P0
P3
P0
P2
A
A
A
A
A A
C
D
B C D
B
Scatter
Gather
Stout and Jablonowski – p. 90/324
Collective Communication
All gatherAP0
P3
P2
P1 P1
P0
P2
P3
P3
P2
P1 P1
P0
P3
P0
P2
A
A
A
A
B
C
D
B C D
B
B
B
C
C
C
D
D
D
All to all
A0 A0
C3
A1 A2 A3
B0 B1 B2 B3
C0 C1 C2
D0 D1 D2 D3
B0 C0 D0
D1C1B1A1
A2
A3 B3
B2 C2
C3
D2
D3
Stout and Jablonowski – p. 91/324
MPI Synchronization
Synchronization is provided
implicitly byBlocking communicationCollective communication
explicitly byMPI_Wait, MPI_Waitany operations for non-blockingcommunication:May be used to synchronize a few or all processorsMPI_Barrier statement:Blocks until all MPI processes have reached barrier
Avoid synchronizations as much as possible to boostperformance.
Stout and Jablonowski – p. 92/324
MPI Datatypes
Predefined basic datatypes, corresponding to theunderlying programming language, examples are
FortranMPI_INTEGERMPI_REAL, MPI_DOUBLE_PRECISION
CMPI_INTMPI_FLOAT, MPI_DOUBLE
Derived data types:Vector: data separated by constant strideContiguous: vector with stride 1Struct: general mixed types (e.g. for C struct)Indexed: Array of indices
Stout and Jablonowski – p. 93/324
MPI Datatype: Vector
Consider a block of memory (e.g. a matrix with integernumbers):
10
15
5
2
3
4 8
9
11
12 16 20 24
2319
18 22
211713
146
7
1
To specify the gray row (in Fortran order), useMPI_Type_vector( count, blocklen, stride, old_datatype,
new_datatype, ierr)MPI_Type_commit (new_datatype, ierr)
Stout and Jablonowski – p. 94/324
MPI Datatype: Vector
In the example, we get
MPI_Type_vector( 6, 1, 4, MPI_INTEGER,my_vector, ierr)
MPI_Type_commit (my_vector, ierr)
The new datatype my_vector is a vector that contains 6blocks, each of 1 integer number, with a stride of 4integers between blocks.
Here, we introduce the Fortran notation of the MPIroutines (with additional error flag ”ierr”).
Fortran, C and C++ notations are very similar.
Stout and Jablonowski – p. 95/324
Some Additional MPI Features
Procedures for creating virtual topologies, e.g., indexingprocessors as a 2-dimensional grid.
User-created communicators (e.g., replaceMPI_COMM_WORLD), useful for selective collectivecommunication (e.g., summing along rows of a matrix),incorporating software developed separately.
Support for heterogeneous systems, MPI convertsbasic datatypes.
Additional user-specified derived datatypes
Stout and Jablonowski – p. 96/324
MPI-2
The MPI-2.1 standard was just approved by the MPI Forumon September 4, 2008, updates the MPI-2.0 standard from1997. Important added features in MPI-2.x include
Parallel I/O Critical for scalability of I/O-intensive problems.
One-sided communication Essentially “put” and “get”operations that can greatly improve efficiency on somecodes. Conceptually these are are the same as directlyaccessing remote memory.
However, these are risky and can easily introduce raceconditions.
Stout and Jablonowski – p. 97/324
One-Sided Communication
Memory
Processor
Memory
ProcessorPut Get
Stout and Jablonowski – p. 98/324
MPI Summary
The MPI standard includes
point-to-point message-passing
collective communications
group and communicator concepts
process topologies (e.g. graphs)
environmental management (e.g. timers, error handling)
process creation and management
one-sided communications
external interfaces
parallel I/O routines
profiling interface
Stout and Jablonowski – p. 99/324
PARALLELIZATION I
Real code is long, complex. How do we engineer theparallelization process?
Usually there is a (perhaps vague) performance goal , not,per se, a parallelization goal.
Stout and Jablonowski – p. 100/324
Overview of Approach
Incremental approach: tackle a bit of the problem at a timeso that one can recover from mistakes and poor attempts.
Verify: Develop test cases and constantly checkresults.
Profile: to determine where time being spent. May becoupled with modeling of code to determine whereeffort will yield most reward.
Check-point/restart: Aids testing and debugging sincesome problems only occur late in the programexecution.
Stout and Jablonowski – p. 101/324
Serial Performance
Often profiling reveals serial performance problems —eliminating these may be critical to attaining performancegoals.
Doubling serial performance is far more useful thandoubling the number of processors
If possible, exploit parallel (or serial) libraries, since they areusually highly tuned for target machine.
Stout and Jablonowski – p. 102/324
Incremental Parallelization
In shared-memory machines, can often incrementallyparallelize and increase efficiency. Portions not parallelizedwill slow the program but will at least be correct. This is amajor advantage of shared memory over distributedmemory.
Some benefits to this approach:
Smaller changes make it is easier to locate mistakes.
It is easier to determine where efficiency is poor.
Should have test cases available and constantly verifycorrectness.
Stout and Jablonowski – p. 103/324
One continues incrementally until desired speedup isattained or it has been determined that the original goalimpractical. Straightforward effort/reward tradeoff, but rarelycarefully considered.
If performance critical, then often final shared memory codevery similar to distributed memory code.
Stout and Jablonowski – p. 104/324
Parallelization Process
PrioritizeChanges
Incrementallychange
Verifycorrectness
Performanceacceptable?
Code readyfor use
Analyze,profile
Set Goals
No
Yes
Stout and Jablonowski – p. 105/324
Process for Distributed Memory
While more complicated, an incremental approach can alsobe utilized for distributed memory machines.
It is harder to get started, but the basic approaches aresimilar. The first things one needs to do are
Do coarse-grained profiling, to determine the timeconsumed in the different sections of the program.
Develop maps of the major data structures and wherethey are used.
The profiling is used to prioritize the areas that need to beparallelized.
Stout and Jablonowski – p. 106/324
Parallelization Steps
Once parallelization plan ready, start parallelizing sectionsof code and data structures.
Initially, all processors have the complete standardserial data structures (global data structures).
As parallelize code and data structures (local datastructures), develop serial-parallel & parallel-serialconversion routines (scaffolding).
Verify correctness on test cases by showingserial–parallel–serial = serial
for global data structures.
Profile to see if efficiency of this piece is acceptable. Ifnot, then develop better alternative.
Stout and Jablonowski – p. 107/324
Incremental DM Parallelization
Serial-Parallel Conversion
Parallel-Serial Conversion
Parallel Code
Serial-Parallel Conversion
Parallel-Serial Conversion
Parallel Code
CodeSerial
Serial Code
Serial Code
Stout and Jablonowski – p. 108/324
Useful to retain the serial-parallel scaffolding (normallyturned off), to help maintain the correspondence betweenthe serial and parallel codes as they evolve.
Stout and Jablonowski – p. 109/324
Useful to retain the serial-parallel scaffolding (normallyturned off), to help maintain the correspondence betweenthe serial and parallel codes as they evolve.
This is probably a complex, important program, since it isworth parallelization effort. Therefore software engineeringconcerns, such as life-cycle maintenance, are veryimportant.
Stout and Jablonowski – p. 109/324
LOAD-BALANCING I
Here we address the question of how one goes aboutsubdividing the computational domain among theprocessors. We introduce the basic techniques that areapplicable to most programs, with some more advancedtechniques appearing later.
Stout and Jablonowski – p. 110/324
Unbalanced Load
0 1 2 3 4 5 6 7
workload per processor
Average
Which processor is the most important for parallelperformance?
Stout and Jablonowski – p. 111/324
Domain and Functional Decomposition
Domain decomposition Partition a (perhaps conceptual)space. Different processors do similar work on differentpieces (quilting bee, teaching assistants for discussionsections, etc.)
Functional decomposition Different processors work ondifferent types of tasks (workers on an assembly line,sub-contractors on a project, etc.)
Functional decomposition rarely scales to manyprocessors, so we’ll concentrate on domain decomposition.
Stout and Jablonowski – p. 112/324
Dependency Analysis
There is a dependency between A and B if value of Bdepends upon A. B cannot be computed before A.
Dependencies control parallelization options.
Stout and Jablonowski – p. 113/324
space
t i m e
Computational Dependencies
Stout and Jablonowski – p. 114/324
Space and Time
Almost always
Time or time-like variables and operations (signals,non-commutative operations, etc.) cannot beparallelized
Space or space-like variables and operations (names,objects, etc.) can be parallelized.
Some operations can have both time-like and space-likeproperties. E.g., ATM transactions are usually toindependent accounts (space-like), but ones to the sameaccount must be done in order (time-like).
Stout and Jablonowski – p. 115/324
Load-Balancing Variety
Many different types of load-balancing problems:
static or dynamic,
parameterized or data dependent,
homogeneous or inhomogeneous,
low or high dimensional,
graph oriented, geometric, lexicographic, etc.
Because of this diversity, need many different approachesand tools.
Stout and Jablonowski – p. 116/324
Complicating Factors
Objects being computed may not have a simpledependency pattern among themselves, makingcommunication load-balancing difficult to achieve.
Objects may not have uniform computationalrequirements, and it may not initially be clear whichones need more time.
If objects are repeatedly updated (such as the elementsin the crash simulation), the computational load of anobject may vary over iterations.
Objects may be created dynamically and in anunpredictable manner, complicating both computationaland communicational load balance.
Stout and Jablonowski – p. 117/324
Static Decompositions
Here we will consider only static decompositions of thework, with dynamic decompositions discussed later. Avariety of basic techniques available, each suitable for adifferent range of problems.
Often just evenly dividing space among the processorsyields acceptable load balance, with acceptableperformance if communication minimized. This approachworks even if the objects have varying computationalrequirements, as long as there are enough objects so thatthe worst processor is likely to be close to the average (lawof large numbers).
Stout and Jablonowski – p. 118/324
Which Matrix Decomposition is Best?
Suppose work at each position only depends on value thereand nearby ones, equivalent work at each position.
MinimizingBoundary
0 1 2 3
4 5 6 7
8 9 10
15
11
12 13 14
MinimizingNumber ofNeighbors
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Stout and Jablonowski – p. 119/324
Matrix Decomposition Analysis
Computation proportional to area so both loadbalanced.
Squares minimize bytes communicated (parallelizationoverhead), so is generally better.
However: Recall, there is significant overhead instarting a message, especially on clusters, so farsmaller matrices may need to concentrate on number,not size, of messages, i.e., use strips.
Stout and Jablonowski – p. 120/324
Local vs. Global Matrices
If serial has matrix A[0 : n−1], and there are p DMprocessors, with ranks 0 . . . p−1
each processor has matrix A[0 : nlocal−1], wherenlocal = n/p
A[i] on processor p corresponds to A[i + p ∗ nlocal] inthe original array
if use A[i+1] and A[i−1] in calculation of A[i],(i 6= 0, n−1), then would have A[−1 : nlocal] to addghost cells
Stout and Jablonowski – p. 121/324
Linear Rank vs. 2-D Indices
To map processorranks 0..15 to
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
rows 0..3 andcolumns 0..3
For processor rank i, rowi = ⌊i/√p⌋ and coli = i− rowi ∗√
p
Right: (rowi, coli + 1), rank i + 1
Left: (rowi, coli − 1), rank i− 1
Up: (rowi − 1, coli), rank (rowi − 1) ∗ √p + coli
Down: (rowi + 1, coli), rank (rowi + 1) ∗ √p + coli
MPI “virtual topologies” can do this for you.
Stout and Jablonowski – p. 122/324
Graph Decompositions
Very general graph decomposition techniques can be usedwhen communication patterns less regular.
Objects (calculations) represented as vertices (withweights if calculation requirements uneven)
Communication represented as edges (with weights ifcommunication requirements uneven).
Goals:
1. assign vertices to processors to evenly distribute thenumber/weight of vertices, and
2. minimize and balance the number/weight of edgesbetween processors.
Stout and Jablonowski – p. 123/324
What is Best Decomposition?
3
1
4
2
1
3
2
55 6
2
6
6 1
4
5
2
5
5
432
2
4
3
4
3
3
36
5
4
1
6
3
1
Stout and Jablonowski – p. 124/324
Graph Decomposition Tools
Unfortunately, optimal graph decomposition is NP-hard.
Fortunately, various heuristics work well, and high-qualitydecomposition tools are available, such as Metis .
To use a serial tool such as Metis, convert data into formatit requires, run Metis to partition the graph vertices, thenconvert to format your program requires.
Scripts (Perl, Python, etc.) useful to convert formats.
Parallel version, ParMetis, also available.
http://www.cs.umn.edu/ ˜ karypis/metis/metis.html
Stout and Jablonowski – p. 125/324
Using Serial Decomposition Tool
matrix
partitioned
program input format
graph
graph as sparse
Program
Parallel
Convert
Convert
Problem
Metis
Stout and Jablonowski – p. 126/324
Where Do Weights Come From?
If weights are static and objects of the same type haveabout the same requirements, and if types are known inadvance, then:
Sometimes all the same.
Sometimes easy to deduce a priori.
May use simple measurements on small test cases.
May use statistical curve fitting on sample problems.
If types aren’t known in advance, this won’t be useful.
Stout and Jablonowski – p. 127/324
Static Geometric Decompositions
When the objects have an underlying geometrical basis,such as the finite elements representing surfaces of carparts, stars in a galaxy, wires in a VLSI layout, or polygonsrepresenting census blocks in a geographical informationsystem, then the geometry can often be exploited
if communication predominately involves nearby objects.
Geometric decompositions can be based on k-D trees,quad- or oct-trees, ham sandwich theorems, space-fillingcurves, etc., and can incorporate weights.
Stout and Jablonowski – p. 128/324
Recursive Bisectioning
Magenta points require twice as much work as cyan ones.
Stout and Jablonowski – p. 129/324
Recursive Bisectioning
Split work evenly along the x-axis (weighted median).
Stout and Jablonowski – p. 130/324
Recursive Bisectioning cont.
Split each side along y-axis, using median on that side.
Stout and Jablonowski – p. 131/324
Recursive Bisectioning cont.
Now split along x-axis (or z-axis if data 3-dimensional).
Stout and Jablonowski – p. 132/324
Recursive Bisectioning cont.
Cycle through axes until # pieces = # processors.
Stout and Jablonowski – p. 133/324
Recursive Bisectioning cont.
May decide to use only 1 or 2 dimensions to split along,similar to the strip partitioning for matrices.
Closely related to the k-D tree serial data structure.
Stout and Jablonowski – p. 134/324
Space-Filling Curves
The best general-purpose geometric load-balancing comesfrom space-filling curves.
The order in which points are visited in the space-fillingcurve determines how the geometric objects are groupedtogether to be assigned to the processors.
Stout and Jablonowski – p. 135/324
The Hilbert Space-Filling Curve
0 1
23
4
5 6
7 8
9 10
11
1213
14 15 16
17 18
19 20 21
2223
24 25
262728
2930
31
32
33 34
35 36 37
3839
40 41
424344
4546
4748
51
49
50
55 52
53545758
59 56
61
62
60
63
For an implementation, see the references.
Stout and Jablonowski – p. 136/324
Using A Space-Filling Curve
Letters represent work, boldface twice as much work.
A B C D
EF
G H I J K L M
NOP
Q R S
TUVW
X Y
Z
Stout and Jablonowski – p. 137/324
Step 1: Determine Space-Filling Coordinates
A B C D
EF
G H I J K L M
NOP
Q R S
TUVW
X Y
Z
A B C D E F G
H I J K L M N
O P Q R S T U
V W X Y Z
63
2
6
8 1923
242527
29
31
34 363839
40474849 50
525556
5853
59
Stout and Jablonowski – p. 138/324
Step 2: Sort by Space-Filling Coordinates
A B C D
EF
G H I J K L M
NOP
Q R S
TUVW
X Y
Z
A B C D E F G
H I J K L M N
O P Q R S T U
V W X Y Z
63
2
6
8 1923
242527
29
31
34 363839
40474849 50
525556
5853
59
XQW Z Y U T S V R K N M L E D C B FJ O I H P G A
Stout and Jablonowski – p. 139/324
Step 3: Divide Work Evenly Based on Sorted Order
A B C D
EF
G H I J K L M
NOP
Q R S
TUVW
X Y
Z
A B C D E F G
H I J K L M N
O P Q R S T U
V W X Y Z
63
2
6
8 1923
242527
29
31
34 363839
40474849 50
525556
5853
59
XQW Z Y U T S V R K N M L E D C B FJ O I H P G A
Stout and Jablonowski – p. 140/324
Z- Ordering
Aka Morton or shuffled bit ordering. For 2-D,
point (x2x1x0, y2y1y0)
mapped to y2x2y1x1y0x0
39
40 41
42 43
44 45
46 47
48 49
50 51
52 53
54 55
57
58 59
60 61
62 63
2
56
5
7
8 9
10 11
12 13
14 15
16 17
18 19
20 21
23
24 25
30 31
0 1 4
3
27
28 29
26
6 22
32 33
34 35
36 37
38
For 3-D, (xk . . . x1x0, yk . . . y1y0, zk . . . z1z0) → zkykxk . . . z1y1x1z0y0x0
Stout and Jablonowski – p. 141/324
Hilbert vs. Z
Both extend to arbitrary dimensions.
Both give regions with boundary (communications)within constant factor of optimal.
Hilbert ordering assigns only 1 contiguous region to aprocessor, Z- ordering may assign 2.
Z slightly easier to compute than Hilbert.
Hilbert can be used for surface of cube or sphere, Zdoesn’t seem to be as useful.
In practice, little difference in performance.
Stout and Jablonowski – p. 142/324
High-Dimensional Data
For high dimensions Hilbert ordering requires extensivememory to store tables used to compute index.
However, often not relevant since
geometric approaches not nearly as usefulon high-dimensional data
Stout and Jablonowski – p. 143/324
Shared Memory Parallelization
Parallel programming on shared memory (SM) machineshas always been important in high performance computing.
All processors can access all the memory in the parallelsystem (access time can be different).
In the past: Utilization of such platforms has never beenstraightforward for the programmer.
Vendor-specific solutions via directive-based compilerextensions dominated until the mid 90’s.
Also: data parallel extensions to Fortran90, HighPerformance Fortran (HPF), but lack of efficiency.
Stout and Jablonowski – p. 144/324
Parallelization Techniques: OpenMPSince 1997: OpenMP is the new industry standard forshared memory programming.
In 2008: The OpenMP Version 3.0 specification wasreleased (new feature: task parallelism).
OpenMP is an Application Program Interface (API):directs multi-threaded shared memory parallelism⇒thread based parallelism
Explicit (not automatic) programming model: theprogrammer has full control over the parallelization,compiler interprets parallel constructs.
Based on a combination of compiler directives, libraryroutines and environment variables.
OpenMP uses the fork-join model of parallel execution.
Stout and Jablonowski – p. 145/324
OpenMPOpenMP can be interpreted by most commercial Fortranand C/C++ compilers , supports all shared-memoryarchitectures including Unix and Windows platforms, andhence
should be your programming system of choice forshared memory platforms
OpenMP home page and recommended online tutorial:http://www.openmp.org
http://www.llnl.gov/computing/tutorials/openMP/
Stout and Jablonowski – p. 146/324
Goals of OpenMP
Standardization: standard among all shared memoryarchitectures and hardware platforms
Lean: simple and limited set of compiler directives forshared memory machines. Often significant parallelismby using just 3-4 directives.
Ease of use: supports incremental parallelization of aserial program, unlike MPI which typically requires anall or nothing approach.
Portability: supports Fortran (77, 90, 95), C (C90,C99) and C++
Stout and Jablonowski – p. 147/324
OpenMP: 3 Building BlocksCompiler directives (imbedded in user code) for
parallel regions (PARALLEL)parallel loops (PARALLEL DO)parallel sections (PARALLEL SECTIONS)parallel tasks (PARALLEL TASK)sections to be done by only one processor (SINGLE)synchronization (BARRIER, CRITICAL, ATOMIC,locks, etc.)data structures (PRIVATE, SHARED, REDUCTION)
Run-time library routines (called in the user code) likeOMP_SET_NUM_THREADS,OMP_GET_NUM_THREADS, etc.
UNIX Environment variables (set before programexecution) like OMP_NUM_THREADS, etc.
Stout and Jablonowski – p. 148/324
OpenMP: The Fork-Join Model
Parallel execution is achieved by generating threads whichare executed in parallel (multi-threaded parallelism):
OF
RK
OJ
IN
OF
RK
IOJ
Nthread
master
parallel region parallel region
Stout and Jablonowski – p. 149/324
OpenMP: The Fork-Join Model
Master thread executes sequentially until the firstparallel region is encountered.
FORK: The master thread creates a team of threadswhich are executed in parallel.
JOIN: When the team members complete the work,they synchronize and terminate. The master threadcontinues sequentially.
Number of threads is independent of the number ofprocessors.
Quiz: What happens if# threads or tasks > # processors# threads or tasks < # processors
Stout and Jablonowski – p. 150/324
OpenMP: Work-sharing ConstructsDO/for loops: type of “data parallelism”
SECTION: breaks work into independent sections thatare executed concurrently by a thread (“functionalparallelism”), units of work are statically defined atcompile time
TASK: breaks work into independent tasks that areexecuted asynchronously in the form of dynamicallygenerated units of work (“irregular parallelism”),
SINGLE: serializes a section of the code. Useful forsections of the code, that are not threadsafe (I/O).
OpenMP recognizes compiler directives that start with!$OMP (in Fortran)#pragma omp (in C/C++)
Stout and Jablonowski – p. 151/324
OpenMP: Work-sharing Constructs
Fork Fork Fork
Join Join Join
team teamDO/for loop
DO/for loop SECTIONS SINGLE
master threadmaster thread
master thread master thread
No barrier upon entry to these constructs, but impliedbarrier (synchronization) at the end of each⇒functionality of the OpenMP directive !$OMP BARRIER
Stout and Jablonowski – p. 152/324
Parallel Loops (1)
⇒ in Fortran notation
!$OMP PARALLEL DO
DO i = 1, na(i) = b(i) + c(i)
END DO
!$OMP END PARALLEL DO
Stout and Jablonowski – p. 153/324
Parallel Loops (2)
Each thread executes a part of the loop.
By default, the work is evenly and continuously dividedamong the threads⇒ e.g. 2 threads:
thread 1 works on i = 1 . . . n
2
thread 2 works on i = (n
2+ 1) . . . n
The work (number of iterations) is statically assigned tothe threads upon entry to the loop.
Number if iterations cannot be changed during theexecution.
Implicit synchronization at the end, unless “NOWAIT”clause is specified.
Highly efficient, low overhead.
Stout and Jablonowski – p. 154/324
Parallel Sections (1)
⇒ in Fortran notation
!$OMP PARALLEL SECTIONS
!$OMP SECTIONDO i = 1, n
a(i) = b(i) + c(i)END DO
!$OMP SECTIONDO i = 1, k
d(i) = e(i) + e(i-1)END DO
!$OMP END PARALLEL SECTIONS
Stout and Jablonowski – p. 155/324
Parallel Sections (2)
The two independent sections can be executedconcurrently by two threads.
Units of work are statically defined at compile time.
Each parallel section is assigned to a specific thread,executes work from start to finish.
Thread cannot suspend the work.
Implicit synchronization unless “NOWAIT” clause isspecified.
Nested parallel sections are possible, but can becostly due to high overhead of parallel regioncreation.difficult to load balance, possibly unneeded sync.therefore: impractical
Stout and Jablonowski – p. 156/324
Parallel Tasks (1)
Main change in OpenMP 3.0 (May 2008)
Allows to parallelize irregular problems likeunbounded loops (e.g. while loops)recursive algorithms
Unstructured parallelism
Dynamically generated units of work
Task can be executed by any thread in the team, inparallel with others
Execution can be immediate or deferred until later
Execution might be suspended and continued later bysame or different thread
Stout and Jablonowski – p. 157/324
Parallel Tasks (2)Example: Pointer chasing in C notation
#pragma omp parallel{
#pragma omp single{
p = listhead ;while (p) {
/* create a task for each element of the list */#pragma omp taskprocess (p) ; /* process the list element p */p=next(p);}
}}
Stout and Jablonowski – p. 158/324
Parallel Tasks (3)
Single construct ensures that only one threadtraverses the list
Single thread encounters task directive and invokesthe independent tasks
“Task” construct gives more freedom for scheduling,can replace loops with if statements that are not wellload-balanced
Parallel tasks can be nested within parallel loops orsections
Stout and Jablonowski – p. 159/324
Parallel Loops and Scope of Variables
Parallel DO loops (“for” loops in C/C++) are often themost important parallel construct.
The iterations of a loop are shared across the team(threads).
A parallel DO construct can have different clauses likeREDUCTION .
sum = 0.0!$OMP PARALLEL DO REDUCTION(+,sum)
DO i = 1, nsum = sum + a(i)
END DO
!$OMP END PARALLEL DOStout and Jablonowski – p. 160/324
Parallel Loops and Load Balancing
Example of a parallel loop with dynamic load-balancing:
!$OMP PARALLEL DO PRIVATE(i,j), SHARED(X,N),!$OMP& SCHEDULE (DYNAMIC,chunk)
DO i = 1, nDO j = 1, i
x(i) = x(i) + jEND DO
END DO
!$OMP END PARALLEL DO
Stout and Jablonowski – p. 161/324
Parallel Loops and Load Balancing
Iterations are divided into pieces of size chunk.
When a thread finishes a piece, it dynamically obtainsthe next set of iterations.
DYNAMIC scheduling improves the load balancing,default: STATIC .
Tradeoff: Load Balancing and OverheadThe larger the chunk, the lower the overhead.The smaller the size (granularity), the better thedynamically scheduled load balancing.
Stout and Jablonowski – p. 162/324
New in OpenMP 3.0: Loop Collapsing
Loops can be collapsed via the clause COLLAPSE
!$OMP PARALLEL DO COLLAPSE(2)
DO k = 1, pDO j = 1, m
DO i = 1, nx(i,j,k) = i*j + k
END DOEND DO
END DO
!$OMP END PARALLEL DO
Stout and Jablonowski – p. 163/324
Loop Collapsing
Iteration space from the two loops is collapsed into asingle one
Good ifloops k and j do not depend on each other (norecursions)execution order can be interchangedloop limits p and m are small, #processors is large
Rules:perfectly nested loops (j loop immediately follows kloop)rectangular iteration space (m independent of p)
Stout and Jablonowski – p. 164/324
Quiz: Is there something wrong ?
Assume: 4 parallel shared memory threads, all arrays andvariables are initialized.
! start the parallel region!$OMP PARALLEL PRIVATE(pid), SHARED(a,b,n)! get the thread number (0..3)pid = OMP_GET_THREAD_NUM()! parallel loop!$OMP DO PRIVATE(i)DO i = 1, n
A(pid) = A(pid) + B(i) ! computeEND DO!$OMP END DO! end the parallel region!$OMP END PARALLEL
Stout and Jablonowski – p. 165/324
False Sharing Example
Suppose you have P shared memory processors, withpid = 0 . . . P-1
Each processor runs the Fortran code:DO i = 1, n
A(pid) = A(pid) + B(i)END DO
No read nor write (load and store) conflicts, since notwo processors read or write same element, but:
Performance is horrible!
Stout and Jablonowski – p. 166/324
False Sharing Example
Reason:
Several consecutive elements of A are stored in samecache line.
In each iteration, each processor gets an exclusive copyof entire cache line to write to, all other processors mustwait.
B read-only, so sharing not a problem.
⇒ Can be avoided by declaring A(c,0:P-1), where c elementsequal 1 cache line, and using A(1,pid).
False sharing is usually obvious once pointed out, but veryeasy to write in and overlook. Avoid!
Stout and Jablonowski – p. 167/324
False Sharing Example
lines (not shared)
��
��
������������������������
������������������������
��������������
��������������
����
��������������������������
��������������������������
��������
�����������������������
�����������������������
c different cache
2D: A(c,0:P−1)
1D: A(0:P−1) same cache line:cache conflicts
��
Stout and Jablonowski – p. 168/324
Race Conditions
In a shared memory system, one common cause of errorsis when a processor reads a value from a memory locationthat has not yet been updated.
This is a race condition , where correctness dependson which processor performed its action first.
Often hard to debug because the debugger often runsthe program in a serialized, deterministic ordering.
To insure that “readers” do not get ahead of “writers”,process synchronization is needed.
DM systems: messages are often used to synchronize,with readers blocking until the message arrives.
Shared memory systems: barriers, softwaresemaphores, locks or other schemes are used.
Stout and Jablonowski – p. 169/324
Race Condition Example
Two PARALLEL SECTIONS :
!$OMP PARALLEL SECTIONS
!$OMP SECTIONA = B + C
!$OMP SECTIONB = A + C
!$OMP END PARALLEL SECTIONS
Unpredictable results since the execution order matters.
Program will not fail: Wrong answers without a warningsignal!
Stout and Jablonowski – p. 170/324
OpenMP: Traps
OpenMP is a great way of writing fast executingcode and your gateway to special painful errors.
OpenMP threads communicate by sharing variables.
Variable Scoping: Most difficult part of shared memoryparallelization
Which variables are sharedWhich variables are private
If using libraries: Use the threadsafe library versions.
Avoid sequential I/O (especially when using a singlefile) in a parallel region: Unpredictable order.
Stout and Jablonowski – p. 171/324
OpenMP: Traps
Common problems are:
False sharing: Two or more processors accessdifferent variables that are located in the same cacheline. At least one of the accesses is a “write” whichinvalidates the entire cache line.
Race condition: The program’s result changes whenthreads are scheduled differently.
Deadlock: Threads lock up waiting for a lockedresource that will never become available.
Stout and Jablonowski – p. 172/324
Something to think about over the break
Question: How would you distribute the work in a climatemodel ?
Latit
ude
Longitude
South Pole
Equator
North Pole
Stout and Jablonowski – p. 173/324
HYBRID COMPUTING
Many of today’s most powerful computers employ bothshared memory (SM) and distributed memory (DM)architectures.
These machines are so-called hybrid computers.
The corresponding hybrid programming model is acombination of shared and distributed memoryprogramming (e.g. OpenMP and MPI).
Today: hybrid architectures are dominant at the highend of computing.
In the future: the hybrid memory architecture is likely toprevail despite popular DM machines like IBM’s “BlueGene”.
Stout and Jablonowski – p. 174/324
Memory Systems: Distributed Memory
All memory is associated with processors.
To retrieve information from another processor’smemory a message must be sent over the network.
Advantages:Memory is scalable with number of processorsEach processor has rapid access to its own memorywithout interference or cache coherency problemsCost effective: can use commodity parts
Disadvantages:Programmer is responsible for many of the details ofthe communicationMay be difficult to map the data structureNon-uniform memory access (NUMA)
Stout and Jablonowski – p. 175/324
Memory Systems: Shared Memory
Global memory space, accessible by all processors
Memory space may be all real or may be virtual
Consistency maintained by hardware, software or user
Advantages:Global address space is user-friendly, algorithm mayuse global data structures efficientlyData sharing between tasks is fast
Disadvantages:Maybe lack of scalability between memory andCPUs. Adding more CPUs increases traffic onshared memory - CPU pathUser is responsible for correct synchronization
Stout and Jablonowski – p. 176/324
Hybrid Memory Architecture
The shared memory component is usually a cachecoherent (CC) SMP node with either uniform(CC-UMA) or non-uniform memory access (CC-NUMA)
CC: If one processor updates a variable in sharedmemory, all the other processors on the SMP nodeknow about the update.
The distributed memory component is a cluster ofmultiple SMP nodes .
SMP nodes can only access their own memory, not thememory on other SMPs.
Network communication is required to move data fromone SMP node to another.
Stout and Jablonowski – p. 177/324
Hybrid Memory Architecture
CPUCPU
CPU CPU
CPUCPU
CPU
CPUCPU
CPU CPU
CPUCPUCPU
CPUCPUMemory
Network
Memory
Memory Memory
SMP nodeSMP node
SMP node SMP node
CPU: single−core or multi−core technology possible
Multi-core (dual- or quad-core) chips common, even inlaptops
Typical: Several multi-core chips form an SMP node.Stout and Jablonowski – p. 178/324
Multi-Cores and Many-CoresGeneral trend in processor development: multi-core tomany-core with tens or even hundreds of cores
AdvantagesCost advantage.Proximity of multiple CPU cores on the same die,signal travels less, high CC clock rate.
Disadvantages:More difficult to manage thermally than lower-densitysingle-chip design.Needs software (e.g. OS, commercial) support.Multi-cores share system bus and memorybandwidth: limits performance gain. E.g. ifsingle-core is bandwidth-limited, the dual core is only30%-70% more efficient.
Stout and Jablonowski – p. 179/324
Dual Level ParallelismOften: Applications have two natural levels of parallelism.Take advantage of it and exploit the shared memoryparallelism by using OpenMP on an SMP node. Why?
MPI performance degrades whendomains become too smallmessage latency dominates computationparallelism is exhausted
OpenMPtypically has lower latencycan maintain speedup at finer granularity
Drawback:
Programmer must know MPI and OpenMP
Code might be harder to debug, analyze and maintainStout and Jablonowski – p. 180/324
Hybrid Programming Model
Combination of distributed and shared memoryprogramming models, e.g.:
MPI and OpenMPMPI and High Performance Fortran (HPF)MPI and POSIX Threads
Most important: MPI and OpenMPMany MPI processesEach MPI process is assigned to different SMP nodeExplicit message passing between the nodesShared memory parallelization within an SMP nodeEach MPI process is therefore a multithreadedOpenMP processCan give better scalability than pure MPI or OpenMP
Stout and Jablonowski – p. 181/324
Hybrid Programming Strategy
Decompose the computational domainMost often: Domain decompositionAlternatively: Functional decomposition
Distribute the partitions among the SMP nodes (coarsegrain parallelism).
Use MPI to communicate the ghost regions orinterfaces of each partition.
Add OpenMP for loop-level parallelism within a partitionon the SMP node (fine grain parallelism).
Let one OpenMP thread speak for all.
Stout and Jablonowski – p. 182/324
Hybrid Programming Strategy
Recommended:Limit MPI communication to serial OpenMP part(outside a parallel region)Let the master thread (serial OpenMP part)communicate via MPI messages.
Stout and Jablonowski – p. 183/324
VECTOR PARALLEL COMPUTING
Principles behind vector parallel computing
Vector pipeline
Pipelining and modern scalar processors
Characteristics of vector computers
Load Balancing and Grid Partitioning Strategies
Graphics Processing Units (GPUs)
Stout and Jablonowski – p. 184/324
Vector Computers: Trend
What are the trends in high performance computing ?
Stout and Jablonowski – p. 185/324
Vector Computers: TrendWorldwide: vector computers became less and lesscommon over the last 15 years
In 2008: NEC and Cray remain in this market
Powerful vector architecture:41 TFlop/s NEC SX-6 system (peak performance):Earth Simulator (Japan, #49 TOP500 list in 6/2008,#20 in 6/2007, #1 from 2002-2004)NEC SX-9, theoretical peak performance 839TFlop/sCray XT5h (newest installation in Edinborough),hybrid architecture with X2 vector processing node
Extreme sustained performance: Earth Simulatorsystem reaches approx. 90% of its peak performance(Linpack benchmark)
Stout and Jablonowski – p. 186/324
Vector Processing - Pipelining Principle
Principle: Split an operation into independent parts &execute them concurrently in specialized pipelines
Example: Add pipeline
DO I = 1, 1000C(I) = A(I) + B(I)
ENDDO
Independent steps:compare and normalize exponentsadd mantissaenormalize resulterror handling (overflow/underflow)
Stout and Jablonowski – p. 187/324
Vector Pipelines: Example (cont.)
1
1
1
1
2
2
2
2
3
3
3
3
3
4
4
4
4
4
3 4 5
5
5
5
5
51 2
21
Startup phase
add mantissae
compare exponents
check errors
normalize results
normalize
Streaming phase
Two phases:Startup phase (fill the pipeline)Streaming phase (1 result per clock cycle)
Stout and Jablonowski – p. 188/324
Vector Processing - Principles
SIMD principle: One instruction works on a data stream(vector).
Vector: A vector consists ofdata that lie consecutively in memory (ideal case)data with constant stridedata with random access (gather & scatteroperations)
Pitfall: Non-consecutive memory accesses can lead tomemory bank conflicts and performance losses.
Stout and Jablonowski – p. 189/324
Principles (cont.)
Pipelining: The functional units are divided into independentsegments which work simultaneously.
Add pipelineMultiply pipelineMultifunctional pipeline, e.g. multiply and addLogic pipelineLoad/Store pipelineInstruction pipeline
Stout and Jablonowski – p. 190/324
Pipelines and Modern Scalar Processors
The pipelining principle: basis for all vector machinesand GPUs.
But pipelines are also used in modern scalarprocessors⇒ speed up execution
Examples:
IBM Power6 CPU: Floating point units (FPU) which canissue a combined multiply/add
a = b* c + c
Multi-functional hardware unitIn addition: data prefetch capabilities (“loadpipeline”)
Stout and Jablonowski – p. 191/324
Vector processing - Hardware differences
Scalar:
Memory
scalar
CPU
data addresses
Stout and Jablonowski – p. 192/324
Vector processing - Hardware differences
Vector:
Memory
...Bank n-1
Bank 0 mod(Addr.,n)Bank =
vector scalardata addresses
CPU
Stout and Jablonowski – p. 193/324
Vector Processing - Features
The new hardware/software features are:
Vector unit: “co-processor” to scalar unit
Pipeline sets
Vector registers that provide data streams
Interleaving memory banks: quick memory access
(Often) no data cache for vector unit
Software & hardware interface: vector instructions
Vectorizing compiler
“Break Even Point” is hardware dependent (vectorlength that lets the vector unit outperform the scalarunit)
Stout and Jablonowski – p. 194/324
Vector Processing - Features
The performance of the vector unit depends on the vectorlength (number of operations):
number of operations
perf
orm
ance
n1/2 n
In general: long vectors boost the performance
Startup time becomes negligible with increasing n
Stout and Jablonowski – p. 195/324
Load-Balancing & Grid Partitioning
Left: Fragmented 2D grid partitioning good forload-balancing, but short vectorsRight: good vectorization (long vectors), but possibly badload-balancing properties
Stout and Jablonowski – p. 196/324
Load-Balancing & Grid Partitioning
The more processors run the simulation the smaller arethe partitions, the smaller is the vector length on eachprocessor.
From a computational standpoint: partitioning strategyon the left is well-load-balanced (e.g. day/night sides ina weather model have different workloads and arewell-distributed).
From a numerical performance standpoint: distributionon the right is more efficient (longer vectors), but suffersfrom load imbalances.
⇒ : In case of uneven workloads balance must be foundbetween long vectors and fragmented load balancingstrategy.
Stout and Jablonowski – p. 197/324
Parallel Vector Computing
Parallel vector computers are powerful for scientificapplications:
sustained performance can reach more than 30% ofthe peak performancecompare: on MPP machines approx. 10-20% of thepeak performance is reached (optimistic)
Single processor performance on vector machines is amultiple of any scalar processor.
Computations need smaller number of parallel vectorprocessorsAdvantageous if application does not scale well tolarge number of parallel CPUs
Stout and Jablonowski – p. 198/324
Parallel Vector Computing (cont.)
Parallel vector machines become most effective forlarge application that require identical (arithmetic)instructions on streams of data.
The vector performance strongly depends onVector length The longer the more effective !Data access Consecutive data access outperforms
indirect addressing and data with constant stride.Number of operations The more arithmetic operations
can be performed at once the more effective thevector unit (enables chaining).
Stout and Jablonowski – p. 199/324
Graphics Processing Units (GPUs)
Newest trend in high-performance computing.
Traditionally: GPU dedicated graphics rendering devicefor a personal computer, workstation or game console.
GPUs have a parallel many-core architecture, eachcore capable of running thousands of threadssimultaneously, exploit SIMD fine-grain parallelism
Highly parallel structure makes them more effectivethan general- purpose CPUs for a range of complex(highly specialized) algorithms.
Stout and Jablonowski – p. 200/324
Graphics Processing Units
Trend: Highly diverse computing platforms can includemulti-cores, SMP nodes, graphics accelerators orclassical vector units as co-processors for boththread-based and process-based parallelism.
GPUs are cheap: commodity co-processors producedin the millions.
Very fast, first 1 TFlop/s GPU was out in February 2008
#1 computer on Top 500 list: “Roadrunner” utilizesGPUs as accelerators: IBM’s GPU Cell processororiginally designed for the Sony Playstation 3.
Stout and Jablonowski – p. 201/324
Stout and Jablonowski – p. 202/324
GPUs — Future?
Extremely difficult to use the hardware effectively.
For example: NVIDIA’s GeForce GPU seriesprogrammed in CUDA (Compute Unified DeviceArchitecture): compiler and set of development tools(variation of C).
Big question: What is the lifetime of these systems? Isit worth investing into user software?
Need robust hardware: Error trapping, IEEEcompliance, hardware performance counters, circuitsupport for synchronizations.
Need robust compilers and programming standards.
Will it attract new sources of talent to supercomputing?
Stout and Jablonowski – p. 203/324
PARALLELIZATION II
Here we examine some of the more complicated aspects ofsuccessfully parallelizing large programs.
Stout and Jablonowski – p. 204/324
Problems Verifying Correctness
Proving parallel and serial programs equivalent typicallyonly possible if the parallelization automated (such as aparallelizing compiler).
Thus usually resort to testing on selected inputs.
Sensitivity & efficiency at discovering errors can bemagnified by examining intermediate results, ratherthan just final results.
Stout and Jablonowski – p. 205/324
However, problems remain:
Coverage: Need to test all program options.
Time: Some conditions only appear after a significantamount of computation.
Detection: Often simple “diff” won’t work, hard todifferentiate between errors and roundoff caused bychanged order of arithmetic operations. Some usersuncomfortable with slight machine variations.
Stout and Jablonowski – p. 206/324
Some Solutions
Coverage: Typically requires coordination with applicationexpert, and careful analysis. There are tools to checkcoverage.
Time: Checkpoint/restart can help. Also very useful forlong-term maintenance and for production runs so thatwork not lost if system fails during a long run.
Detection: Use of IEEE arithmetic helps cross-platformcomparisons. Also, by being careful can insure that theparallel and serial programs perform all calculations(such as summations) in the same order, but usuallythis lowers efficiency.
Stout and Jablonowski – p. 207/324
Performance Problems
Detailed profiling of the crash code showed that there weremany places where efficiency was unacceptable.
Often cache utilization was very poor.
Load balance difficult due to heterogeneous elementswith time-varying requirements.
Contact adds dynamic computational andcommunication imbalance.
Some of the collective communication routines were tooslow.
I/O was substantial, and was often inefficient.
Stout and Jablonowski – p. 208/324
Profiling
Profiling proceeded in stages, identifying whereefficiency was too low.
For targeted section, profiled uniprocessorperformance, such as cache misses.
Also profiled load imbalance and communicationoverhead, proceeding from smaller systems to largerones (when needed).
Incremental approaches kept the amount of datacollected at manageable reasonable levels.
Unfortunately, when were doing this there were nostandard tools, we had to build several. Situation muchbetter now, discussed later.
Stout and Jablonowski – p. 209/324
Utilizing the Memory Hierarchy
Effective use of cache and locality often critical forachieving high performance.
Often uniprocessor performance can be doubled byrestructuring data structures and computations toexploit cache
Unfortunately, many data structures and algorithms usepointers and indirect addressing, diminishing the abilityof the compiler to optimize cache usage.
Later we’ll describe a data structure (adaptive blocks)that addressed this
Stout and Jablonowski – p. 210/324
Cache Misses
Many programs have excessive loads and stores, causingcache misses which slow the program. Can often bereduced by rearranging the code and/or data structure.
For example, in Fortran
do i=1,n do j=1,ndo j=1, n do i=1,n
A[i,j]=A[i,j]+1 vs A[i,j]=A[i,j]+1enddo enddo
enddo enddo
For large arrays, which is faster, and why?
Stout and Jablonowski – p. 211/324
Utilizing the Compiler
For a well-structured program it should be possible for thecompiler to generate good code — optimizing cacheutilization, reducing instruction counts, etc. However,extensive optimization is not the default. Thus
Turn on appropriate compiler optimization options.
Usually “O” option important, but often others needed aswell. These affect data placement as well as codegeneration.
Stout and Jablonowski – p. 212/324
Utilizing the Compiler
For a well-structured program it should be possible for thecompiler to generate good code — optimizing cacheutilization, reducing instruction counts, etc. However,extensive optimization is not the default. Thus
Turn on appropriate compiler optimization options.
Usually “O” option important, but often others needed aswell. These affect data placement as well as codegeneration.
May need a guru to get best combination of options for yourprogram+machine combination.
Stout and Jablonowski – p. 212/324
LOAD-BALANCING REVISITED
We’ll continue the discussion of load-balancing, looking atsome more complicated problems.
Stout and Jablonowski – p. 213/324
Loop Dependencies
Recall that if the value of variable B depends upon the valueof variable A, then there is a dependency between A and B.
Loops often introduce real, or apparent, dependencies.
For example,
do i=1,nV[i]=V[i] − 2*V[i−1]
enddo
The loop cannot be vectorized nor parallelized, becauseeach value depends upon value from previous iteration.
Stout and Jablonowski – p. 214/324
To parallelize
do i=1,nV[i]=V[i] − 2*V[i+1]
enddo
need to copy V and use copy to compute new values.
Stout and Jablonowski – p. 215/324
To parallelize
do i=1,nV[i]=V[i] − 2*V[i+1]
enddo
need to copy V and use copy to compute new values.
W=Vdo i=1,n
V[i]=W[i] − 2*W[i+1]enddo
Stout and Jablonowski – p. 215/324
To parallelize
do i=1,nV[J[i]]=i
enddo
need to know if J is 1-1.
Some automatic parallelizers can handle the previous loop,but none can do this one without programmer assistance.
Stout and Jablonowski – p. 216/324
Time Troubles
Parallelization problems of time-like variables includes:
Partial Differential Equations Time is explicit.
Divide and Conquer “Size” similar to time. Subproblems maynot be known in advance, and need to be generated inorder.
Branch and Bound Branching control is often serialized.
Discrete Event Simulation Time is usually explicit, may beincremented adaptively, and subproblems often notknown in advance.
Depth-First Search Search decisions made sequentially.Theoretical computer science: some versions of DFS are NC-complete
Stout and Jablonowski – p. 217/324
Not Everything is As Bad As It Seems
Some things look serial but can be easily parallelized.
Reduction
x← 0do i← 0, n-1
x← x + a[i]enddo
Scan or Parallel Prefix
y[0]← a[0]do i← 1, n-1
y[i]← y[i-1] + a[i]enddo
Stout and Jablonowski – p. 218/324
Parallelized Reduction Operations
Reduction and scan operations are extremely common.
They are recognized by parallelizing compilers andimplemented in MPI and OpenMP.
They can be parallelized by using associativity of thecombining operator ( + in this case) i.e.,
a + (b + c) = (a + b) +c
In some situations one also uses commutativity, i.e.,a + b = b + a
Stout and Jablonowski – p. 219/324
Calculation Tree
a[0] a[1] a[2] a[3] a[4] a[5]a[6] a[7]
+ + + +
+ +
+
Stout and Jablonowski – p. 220/324
Static Load Imbalance — Correlation
Suppose have digital image, need to determine types ofvegetation on the island. Easy load-balance:
0 1 2 3
4 5 6 7
8 9 10 11
12 1413 15
Stout and Jablonowski – p. 221/324
However ...
If pixel is water can quickly dismiss it, otherwise need tocarefully analyze pixel and neighbors.
Stout and Jablonowski – p. 222/324
However ...
If pixel is water can quickly dismiss it, otherwise need tocarefully analyze pixel and neighbors.
Drat! We know the weights, but don’t know where the easyor hard pixels are until we’ve started processing the image.
Stout and Jablonowski – p. 222/324
However ...
If pixel is water can quickly dismiss it, otherwise need tocarefully analyze pixel and neighbors.
Drat! We know the weights, but don’t know where the easyor hard pixels are until we’ve started processing the image.
Especially problematic because large regions will be of onetype or the other. Thus some processors will take muchlonger than others.
Stout and Jablonowski – p. 222/324
Scattered Decomposition
Used when there is a structured domain space (e.g., animage) and the processing requirements are clustered,such as modeling a crash or processing an image with onlya few items of interest.
Suppose there are P processors. Cover the problemdomain with non-overlapping copies of a grid of size P andassign each processor a cell in each of the grids.
Stout and Jablonowski – p. 223/324
Scattered Work
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Stout and Jablonowski – p. 224/324
How Much Scattering?
More pieces⇒
⇓ load imbalance, i.e., ⇓ calculation time
⇑ overhead and/or communication time
Deciding a good tradeoff may require some timingmeasurements.
Stout and Jablonowski – p. 225/324
How Much Scattering?
More pieces⇒
⇓ load imbalance, i.e., ⇓ calculation time
⇑ overhead and/or communication time
Deciding a good tradeoff may require some timingmeasurements.
However, if nearby objects have uncorrelated computationalrequirements then this method is no better than standarddecomposition, and adds overhead.
Stout and Jablonowski – p. 225/324
Overdecomposition
Scattered decomposition and its close relatives striping andround robin allocation are examples of a general principle:
Overdecomposition: break task into more piecesthan processors, assign many pieces to eachprocessor.
Overdecomposition underlies several load-balancing andparallel computing paradigms.
However, there can be difficulties when synchronization isinvolved.
Stout and Jablonowski – p. 226/324
The (Teaching) Value of Coins
Task times are random variables, where the time isgenerated by flipping a coin until a head appears.
Your task times:
Class task times:
Your total:
Class total:
Slowest person’s total:
Stout and Jablonowski – p. 227/324
Synchronization and Imbalance
Suppose have p processors and n ≥ p tasks. Supposetasks take time i with probability 2−i, and there is no way totell in advance how long the task will take.
If each processor does 1 task and then waits for allprocessors to complete before going on to the next, theefficiency is low. In fact, it grows as the log of the number ofprocessors.
To improve efficiency, each processor needs to completeseveral tasks before synchronizing.
Stout and Jablonowski – p. 228/324
Geometric Task Times
No. Efficiency Tasks/Proc.Proc. 1 Task per to achieve
Processor Efficiency0.8 0.9
4 0.57065 10 4616 0.37193 30 13764 0.27233 53 243
256 0.21423 78 3551024 0.17647 103 468
Stout and Jablonowski – p. 229/324
Another Example
Tasks: 1 time unit with prob 0.9, 10 units with prob. 0.1
No. Efficiency Tasks/Proc.Proc. 1 Task per to achieve
Processor Efficiency0.8 0.9
4 0.46397 36 17916 0.22803 112 53664 0.19020 199 949
256 0.19000 291 13841024 0.19000 385 1824
Stout and Jablonowski – p. 230/324
Note that one can keep the efficiency high by assigningmany tasks per processor before synchronizing, but thenumber required grows with the number of processors.
Later we’ll see a technique to improve this situation.
Stout and Jablonowski – p. 231/324
Dynamic Data-Driven
For many data dependent problems dynamic versions alsooccur, such as
For PDEs an adaptive grid can be used instead of afixed grid, allowing one to focus computations onregions of interest.
A simulation may track objects through a region.
Computational requirements of objects may changeover time.
In such situations, some processors may becomeoverloaded.
Stout and Jablonowski – p. 232/324
Must balance load and need to take locality ofcommunication into account. Some options:
Locally adjust partitioning, such as moving small regionon boundary of overloaded processor to processorcontaining the neighboring region.
Use a parallel rebalancing algorithm that takes currentlocation into account (not standard).
Rerun the static load-balancing algorithm andredistribute work (ignores locality, but easier)
Warning: Need more complex data structures which canmove pieces and keep track of neighbors, etc. These aredifficult to program and debug.
Stout and Jablonowski – p. 233/324
Dynamic Graph Decomposition
One could rerun Metis at periodic intervals, or periodicallymeasure some metric to determine if processor loads toouneven, and if so then call Metis.
However, more efficient to use the ParMetis package whichruns in parallel.
Stout and Jablonowski – p. 234/324
Example: Dynamic Geometry
Adaptive blocks, useful for adaptive mesh refinement(AMR), dynamic geometric modeling. Grids broken intoblocks of fixed extents, when needed blocks refined intochildren with same extents. [Stout 1997, MacNeice et al. 2000]
refine
coarsen
Stout and Jablonowski – p. 235/324
Adaptive Block Properties
Whenever refine/coarsen occurs, must adjust pointers onall neighbors, no matter what processor they are on.
Using blocks, instead of cells, reduces the number ofchanges.
Same work per block, good work/communication ratio, sooften just balancing blocks per processor suffices. Ifcommunication excessive use space-filling curve.
In either case, rebalancing requires only simple collectivecommunication operations to decide where blocks go.
Stout and Jablonowski – p. 236/324
Load-balancing Strategies
Example: Tracer transport problems with adaptivemesh refinement (AMR) techniques
Simple load-balancing algorithm:Equal workload regardless of the location of the data
Advanced load-balancing algorithms:Load-balancing with METISLoad-balancing with a Space Filling Curve (SFC)
⇒ In the examples:
Each color represents a processor.
The amount of work in each box is the same.
Stout and Jablonowski – p. 237/324
Simple Load-balancing Strategy
-90
-45
0
45
90
Lat
itude
0 90 180 270 360Longitude
MovieStout and Jablonowski – p. 238/324
Simple Load-balancing Strategy cont.
Data distribution at model day 3:
-90
-45
0
45
90
Lat
itude
0 90 180 270 360Longitude
Stout and Jablonowski – p. 239/324
Simple Load-balancing Strategy cont.
Data distribution at model day 12:
-90
-45
0
45
90
Lat
itude
0 90 180 270 360Longitude
Stout and Jablonowski – p. 240/324
Dynamic Load-balancing with METIS
MovieCourtesy of Dr. Joern Behrens, Alfred-Wegener-Institute,Bremerhaven, Germany
Stout and Jablonowski – p. 241/324
Dynamic Load-balancing with SFC
MovieCourtesy of Dr. Joern Behrens, Alfred-Wegener-Institute,Bremerhaven, Germany
Stout and Jablonowski – p. 242/324
Comparison of Strategies
Relative behavior similar to static load-balancing behavior.Very important that rebalance operations have low overheadsince they will be done often.
Easiest strategy — just balance work/processormight be sufficient if application is dominated bycomputation, but not if communication important
Load-balancing with METIS or ParMETISgood load-balancing, decent comm. reduction,applicable to many problems
Load-balancing with Space Filling Curvesfor geometric problems usually the best choice
Stout and Jablonowski – p. 243/324
Dynamic, Data Driven, Min. Comm.
Sometimes work created on the fly with little advanceknowledge of tasks.
E.g„ branch-and-bound generates dynamic partialsolution trees where subproblem communicationconsists of maintaining a current best solution andseeing if subproblem already solved.
In such situations can maintain a queue of tasks(objects, subproblems) and assign to processors asthey finish previous tasks (e.g., overdecomposition).
Stout and Jablonowski – p. 244/324
Example: Work Preassigned
Each processor is assigned 4 tasks.
Processor Task Label/Time Total1 a/5 b/1 c/1 d/4 112 e/1 f/4 g/2 h/1 83 i/2 j/1 k/5 l/1 94 m/1 n/3 o/1 p/1 65 q/1 r/1 s/2 t/2 66 u/3 v/4 w/2 x/3 12
Max 12
Time required: 12.
Stout and Jablonowski – p. 245/324
Manager/Worker (Master/Slave) (prof/grad student)
Manager
worker
worker
workerworker
Task Queue
assign tasks
task donerequestanother
Stout and Jablonowski – p. 246/324
Work Assigned via Queue
Assign tasks a, b, c, ... to processors as the processorbecomes available:
Processor Time / task assigned1 2 3 4 5 6 7 8 9 10
1 a a a a a r v v v v2 b g g k k k k k3 c h j l n n n w w4 d d d d o s s x x x5 e i i m p t t6 f f f f q u u u
Time: 10. Adaptive allocation can improve performance.
Stout and Jablonowski – p. 247/324
Work Assigned via Ordered Queue
Sort in decreasing order, assign to processors as theybecome available. a k d f v n u x g i s t w b c e h j l m o p q r
Processor Time / task assigned
1 2 3 4 5 6 7 8 9
1 a a a a a s s e o2 k k k k k t t h p3 d d d d x x x j q4 f f f f g g w w r5 v v v v i i b l6 n n n u u u c m
Time: 9. The more you know, the better you can do.Unfortunately, rarely have this information.
Stout and Jablonowski – p. 248/324
Queueing Costs
Single-queue multiple-servers (manager/workers) mostefficient queue structure (e.g., airline check-in lines).
However, queuing imposes communication overhead,yet another tradeoff, now cost of moving task versuscost of solving it where it is generated.
Parallel computing has too many “however”s!
However, if it was too easy, you wouldn’t need this tutorial
Stout and Jablonowski – p. 249/324
Queueing Bottleneck
Sometimes the manager is a bottleneck. Can ameliorate
“Chunk” tasks to reduce overhead. May use largechunks initially, then decrease them near the end tofine-tune load balance.
Use distributed queues, perhaps withmultiple manager/worker subteams, with somecommunication between managersevery worker is also a manager, keeping some tasksand sending extras to others. Many variations ondeciding when/where to send work.
Stout and Jablonowski – p. 250/324
OpenMP Load-Balancing
The previous descriptions had a distributed memory flavor,though they also work well for shared memory.
However, shared memory has additional options. OpenMPloop work-sharing constructs require little programmereffort. With the SCHEDULE option can specify
STATIC: simple, suitable if loop iterations take same amountof time and there are enough per processor. Forscattered decomposition, specify chuck size.
DYNAMIC: a queue of work, each processor gets chunksizeiterations when ready.
GUIDED: dynamic queue with chunks of exponentiallydecreasing size.
Stout and Jablonowski – p. 251/324
Load-Balancing Summary
Load-balancing is critical for high performance.
Depending on the application, can range from trivial tonearly impossible. A wide range of approaches are needed,and new ones are constantly being developed.
Load-balancing needs to be approached as part of asystematic effort to improve performance.
Stout and Jablonowski – p. 252/324
Load-Balancing Summary
Load-balancing is critical for high performance.
Depending on the application, can range from trivial tonearly impossible. A wide range of approaches are needed,and new ones are constantly being developed.
Load-balancing needs to be approached as part of asystematic effort to improve performance.
Try simple approaches first.
Stout and Jablonowski – p. 252/324
DATA INTENSIVE COMPUTING
Databases are an important commercial application ofparallel computers, providing a base which helps keepcommercial parallel computing viable.
Massive data collections becoming important in scientificfields such as bioinformatics, astronomy, physics, . . . .
Many of the ideas are used elsewhere, though sometimesobscured by different terminology. We’ll just briefly examinesome aspects.
Stout and Jablonowski – p. 253/324
Application Areas
Web browsing
Real-time applications: air traffic, stock trading,streaming multimedia
Data Warehouse: organize massive amounts ofcommercial, scientific dataCERN Large Hadron Collider: ≈ 30TB/day, ≈10PB/year
Data Mining: extract useful information from vastcollections of text, photographs, web pages, etc.
Stout and Jablonowski – p. 254/324
Some Terminology
Often data intensive systems use terminology that issomewhat different, though often ideas similar to onesalready touched on. Some examples:
skew load imbalancescaleup speeduptransactions per second (TPS) throughput.
TPS is often used to measure performance, instead of flops
Stout and Jablonowski – p. 255/324
Characteristics
Disk access and bandwidth dominates performance.Organizing the information to match the accesspatterns is often critical.
Systems for scientific applications somewhat newer,complicated by factors such as being dispersed amongsites, people trying to combine or mine information innew ways, billions of files (e.g., a constant stream ofimages), etc.
Sample science collections include Large HadronCollider, Digital Sky, Earth Observation System. Manyprovide specialized tools to access the information.
Stout and Jablonowski – p. 256/324
Parallel Disk Architectures
Shared Everything (SE) All disks are directly accessible fromall processors and all memory is shared, i.e., standardshared memory system.
Shared Nothing (SN) Each disk is connected to a singleprocessor or SMP, each has its own private memory.Most common option in clusters.
Shared Disks (SD) Any processor can access any disk, buteach processor has its own private memory, e.g.,storage networks.
Stout and Jablonowski – p. 257/324
Shared Everything
P3P1 Pn
Interconnection Network
Global Shared Memoryshared disks
P2
Stout and Jablonowski – p. 258/324
Shared Nothing
P1 P2 P3 Pn
privatememory
privatememory
privatememory
privatememory
private disk private disk private diskprivate disk
Interconnection Network
Stout and Jablonowski – p. 259/324
Shared Disk
shared disk
Interconnection Network
P3P2P1 Pn
privateprivatememory
privatememorymemory
privatememory
shared disk shared disk
Stout and Jablonowski – p. 260/324
Data Partitioning Strategies
Range Partitioning (block allocation) Easy to locate records,related data can be clustered, but danger of skew.
Disks
Key Range
Stout and Jablonowski – p. 261/324
Data Partitioning continued
Round Robin (cyclic, striping) Allows parallelism in accessingconsecutive records, but ties up many disks if differentprograms running on system.
Disks
Key Range
Stout and Jablonowski – p. 262/324
Data Partitioning continued
Hashing Avoids systematic bottlenecks, allows forexpanding collection of keys (such as names), butcomplicates range queries.
Disks
Key Range
Stout and Jablonowski – p. 263/324
Data Partitioning Parallels
Block allocation and round robin allocation are used individing loops in OpenMP.
Round robin allocation used in memory systems ofvector machines.
Block allocation used in memory systems of commodityprocessors.
Hashed allocation used in memory system of Cydrome.
Stout and Jablonowski – p. 264/324
Data Mining
Sifting for information in a torrent of data economicallyand scientifically important.
AT&T, WalMart, American Express, . . . have used formany years. Bioinformatics important new applicationarea.
Many commercial data mining tools, often parallelized.
Warning: “data mining” means many different things todifferent people and applications.
Stout and Jablonowski – p. 265/324
Map-Reduce: New Form of Data Mining
Variations used by Google, Yahoo, IBM, etc.Open source Hadoop: http://hadoop.apache.org/core/
Companies trying to get schools to teach this style ofprogramming
Basic database operations, extended to less organized,far larger, systems.
Simple example: given records of (source page, link)for every company find # pages from outside thecompany that point to one of the company’s pages.
Stout and Jablonowski – p. 266/324
Map: determine if link record is from page outside acompany into it. If so, generate new record(destination company, 1)
embarrassingly parallel, vast number records, I/Obound
Reduce: combine records by company and sum the counts
requires communication, but far fewer records
Implementations: significant emphasis on locality, efficiency,fault tolerance
Stout and Jablonowski – p. 267/324
Sample Map-Reduce Execution
Source: http://code.google.com/edu/parallel/mapreduce-tutor ial.html
Stout and Jablonowski – p. 268/324
PERFORMANCE
Developing large-scale scientific or commercialapplications that make optimum use of thecomputational resources is a challenge.
Resources can easily be underutilized or usedinefficiently.
The factors that determine the program’s performanceare often hidden from the developer.
Performance analysis tools are essential to optimizingthe serial or parallel application.
Typically measured in “Floating point operation persecond” like Mflop/s, Gflop/s or Tflop/s.
Stout and Jablonowski – p. 269/324
CPU Performance MeasuresPerformance
is compared via benchmarks like LINPACK
more relevant: benchmarks with user application
most often on scalar machines: cache-optimizedprograms reach ≈ 10% of the peak performance
Example: Weather prediction code IFS (ECMWF)
Stout and Jablonowski – p. 270/324
Application-System Interplay
System factors:Chip architecture (e.g. # floating point units per CPU)Memory hierarchy (register - cache - main memory -disk)I/O configurationCompilerOperating SystemConnecting network between processors
Stout and Jablonowski – p. 271/324
Application-System Interplay
Application factors:Programming languageAlgorithms and implementationData structuresMemory managementLibraries (e.g. math libraries)Size and nature of data setCompiler optimization flagsUse of I/OMessage passing library / OpenMPCommunication patternTask granularityLoad balancing
Stout and Jablonowski – p. 272/324
Performance Gains: Hardware
Factor ≈ 104 over the last 15 yearsStout and Jablonowski – p. 273/324
Performance Gains: Software
Gains expected from better algorithms, example:
1970 1980 1990 2000
104
010
1
210
10
10
103
5
Gauss-Seidel
Successive Over-RelaxationConjugate Gradient
Multi-Grid
Sparse Gaussian Elimination
Spe
edup
fact
or
Derived from Computational Methods (Linear Algebra)
Gains also expected from better load-balancingstrategies, parallel I/O, etc.
Stout and Jablonowski – p. 274/324
Parallel Performance AnalysisReliable performance analyses are the key toimproving the performance of a parallelized program.
They reveal not only typical bottleneck situations butalso determine the hotspots
Key question: How efficient is the parallel code?
Important to consider: Time spentcommunicating to other processorswaiting for a message to be receivedwasted waiting for other processors
When selecting a performance tool consider:
How accurate is the technique?Is the tool simple to use?How intrusive is the tool?
Stout and Jablonowski – p. 275/324
Parallel and Serial Performance Analysis
Goal: reduce the program’s wallclock execution timePractical, iterative approach:
measure the code with a hardware performancemonitor and profiler
analyze hotspots
optimize and parallelize hotspots and eliminatebottlenecks
evaluate performance results and improve optimization/ parallelization
Analysis techniquesTiming
Counting
Profiling
Tracing
Stout and Jablonowski – p. 276/324
Timing of Parallel Programs
MPI / OpenMP provide the compiler-independent timingfunctions for the wallclock time
MPI_Wtime / OMP_GET_WTIME
Requires source code changes: instrument the program
Typical sequence (MPI program):real t1, t2, secondst1 = MPI_wtime()
... code to be timedt2 = MPI_wtime()seconds = t2-t1 ! wallclock time
Evaluation of parallel speedup: Always measure thewallclock time . Measuring the CPU time would neglect thesystem overhead for the parallelization!
Stout and Jablonowski – p. 277/324
Hardware Performance Monitors (HPM)
Hardware counters gather performance-relevant events ofthe microprocessor without affecting the performance ofthe analyzed program. Two classes:
Processor monitor:
non-intrusive countsconsists of a group of special purpose registerregisters keep track of events during runtime:general and floating point instructions, cachemisses, branch miss predictionmeasures Mflop/s rate fairly accurately
System level monitor (bus and network monitor):
bus monitor: memory traffic, cache coherencynetwork monitor records network traffic
Stout and Jablonowski – p. 278/324
PAPI: The Portable Performance API
mature public-domain Hardware Performance Monitor
version Papi 3.6.1 released in 8/2008
vendor independent hardware counter tool
supports most current processors including the “Cell”processor
user needs to instrument code⇒ PAPI functions
Fortran and C/C++ user interfaces
easy-to-use and powerful high level API
Home page:http://icl.cs.utk.edu/papi/index.html
Stout and Jablonowski – p. 279/324
Profiling of Parallel Programs
simplest tool: UNIX profiler gprof
interrupts program execution at constant timeintervalscounts the interruptionthe more interruptions the more time spent in thispart of the codesum of all processors is displayed
Profilers identify hotspots , but limited use for parallel code:
they measure CPU time, not wallclock time
they sum over all invocations of each routine
profilers cannot show load imbalance
Stout and Jablonowski – p. 280/324
Profiling: Graphical User Interfaces
Commercial: allinea opt (http://www.allinea.com ),optimization and profiling tool for multiple hardwareplatforms.
IBM AIX systems (built-in): xprofiler
Graphical user interface based upon the gprofprofiling utility.Displays: Timing and call graph profile, summarycharts, source code displays, library clusters.Filtering and zooming features allow focusing thedisplays on portions of the call tree.
Public domain Tuning and Analysis Tool TAU :http://www.cs.uoregon.edu/research/tau
Stout and Jablonowski – p. 281/324
GUI Example Xprofiler
Portions of the program which accumulate the most“ticks” (interrupts) reflect the area where the programspends the most time
Stout and Jablonowski – p. 282/324
Profiling: Pitfalls
Due to the periodic sampling of the program counter theoutput might be slightly different when the sameprogram is profiled multiple times.
Measure the code over a representative time intervalusing typical data sets. Sampling should last at leastseveral minutes.
Optimizing compiler flags are allowed: expect differentprofile when using the -O option, try with and withoutoptimization.
Different hardware / different compilers might lead todifferent profiles.
But: the most time consuming functions should bedetected in any case, maybe in different order.
Stout and Jablonowski – p. 283/324
MPI and OpenMP Trace ToolsCollect trace data at run time, display post-mortem
Assess performance, bottlenecks and load-balancingproblems in MPI & OpenMP codes
Intel’s trace visualization tool Trace Analyzer &Collector (only on Intel platforms)
Vampir and Vampirtrace (platform independent)
Trace analyzer developed and supported by the Centerfor Information Services and High PerformanceComputing, Dresden, Germany (http://vampir.eu )
Free evaluation keys for both available online.
Stout and Jablonowski – p. 284/324
Trace Analyzer & Collector / VampirTrace Analyzer / Vampir graphical user interface helps
understand the application behavior
evaluate load balancing
show barriers, locks, synchronization
analyze the performance of subroutines/code blocks
learn about communication and performance
identify communication hotspots
Trace Collector / Vampirtrace
Libraries that trace MPI and application events,generate trace file (files can become big!)
Convenient: Re-link your code and run it
Provides API for more detailed analysesStout and Jablonowski – p. 285/324
Graphical User InterfaceTrace Analyzer / Vampir provides graphical displays thatvisualize important aspects of the runtime behavior:
detailed timeline view of events and communication
statistical analysis of program execution
statistical analysis of communication operations
dynamic calling tree and source-code display
I/O statistics
Trace Analyzer / Vampir
provides powerful zooming and filtering features
can display source code references if recorded
Vampir supported on almost all HPC platforms
Stout and Jablonowski – p. 286/324
Vampir Analysis – Global Timeline ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
here: uninstrumented version of the program
therefore: the routines of the user code can not bedistinguished and are displayed as “Application”
Stout and Jablonowski – p. 287/324
Vampir Analysis – Zoom-in Timeline ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Zoom-in: ⇒ Communication and synchronizationStout and Jablonowski – p. 288/324
Vampir Analysis – Activity Chart
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Global activity chart⇒ Load-imbalance
Stout and Jablonowski – p. 289/324
Vampir Analysis – Summary Chart ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Summary for the whole application: timing data
Stout and Jablonowski – p. 290/324
Vampir Analysis – MPI Summary ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
Stout and Jablonowski – p. 291/324
Public Domain Trace ToolsJumpshot-4 (http://www-unix.mcs.anl.gov/perfvis/ )
Graphical displays of timelines, histograms, MPIoverhead and more
Instant zoom in/out, search/scan facility
TAU – Tuning and Analysis Utilities (version 2.17.1)
Developed at the University of Oregon, mature
Free, portable, open-source profiling/tracing facility(http://www.cs.uoregon.edu/research/tau )
Performance instrumentation, measurement andanalysis toolkit for distributed and shared memoryapplications (includes MPI, OpenMP)
Graphical displays for all or individual processes
Manual or automatic source code instrumentationStout and Jablonowski – p. 292/324
Performance Analysis: Strategy
Hardware counters provide information on Mflop/srates, do you need to optimize?
Use profilers to identify hotspots
Focus the analysis/optimization efforts on the hotspots
Analyze trace information : gives detailed overview ofthe parallel performance, load-balance and revealsbottlenecks
two different modes: the uninstrumented orinstrumented mode (requires source code changes)⇒ Pitfall: can lead to huge trace files)Recommendation: instrument only hotspots fordetailed view of the run time behavior
Stout and Jablonowski – p. 293/324
Debugging of Parallel ProgramsIncreased parallel complexity makes the debuggingprocess more difficult.
Traditional sequential debugging technique is cyclicapproach where the program is repeatedly stopped atbreakpoints and then continued or re-executed again.
Conventional style of debugging sometimes difficult withparallel programs: they do not always showreproducible behavior , e.g. race condition .
Always: turn on compiler debugging options likearray-bound checks
Most powerful commercial debuggers:
TotalView (http://www.totalviewtech.com )
allinea ddt (http://www.allinea.com )Stout and Jablonowski – p. 294/324
Characteristics of Totalview
Very powerful and mature debugger, current version 8.6
Source-level, graphical debugger for C, C++, Fortran,High Performance Fortran (HPF) and assembler code
Multiprocess (MPI) and multithread (OpenMP) codes
Supports multi-platform applications
Intuitive, easy-to-learn graphical interface
Industry leader in MPI and OpenMP debugging
Control functions to run, step, breakpoint, interrupt orrestart a process
Ability to control all parallel processes coherently
Good tutorial on TotalView with parallel debugging tips:http://www.llnl.gov/computing/tutorials/totalview/
Stout and Jablonowski – p. 295/324
TotalView: The Process Window5 panes
zoom intocode orvariables
visualizevariables
filter, sort orslice data
set break-points
scan parallelprocessses
step by stepexecution
Stout and Jablonowski – p. 296/324
TotalView: Message Queue Graph
Graphical representation of the message queue state⇒ Red = Unexpected, Blue = Receive, Green = Send
Stout and Jablonowski – p. 297/324
Boost the Performance: Practical TipsTurn on compiler optimization flags
Search for better algorithms and data structures
For scientific codes: use optimized math libraries
Tune the program:data locality and cache re-use within loopsavoid divisions, indirect addressing, IF statements,especially in loopsloop unrolling and function inlining (often compileroption), minimize/optimize I/O, ...
Load-balance the code
Avoid synchronization/barriers whenever possible
Optimize partitioning to minimize communication
Identify inhibitors to parallelism: data dependencies, I/OStout and Jablonowski – p. 298/324
Parallel Scientific Math LibrariesParallel math libraries are available on most hardwareplatforms. Highly optimized and recommended.
ScaLAPACK (Scalable LAPACK):Public-domain, high-performance linear algebraroutines for MPI applicationsPromotes modularity via interfaces to the librariesBLAS, BLACS and PBLAS
NAG Parallel Libraries (commercial, often installed):Mostly high speed linear algebra routinesIn addition: random number generation andquadrature routines
PETSc (Portable, Extensible Toolkit for Scientificcomputation):
Designed with MPI for partial differential equationsStout and Jablonowski – p. 299/324
Toolkits for Scientific ComputingACTS toolkit — Advanced CompuTational Software(http://acts.nersc.gov ):
Public domain tools mostly developed at US labs
Collection of tools that is interoperable, with API
General solutions to complex programming needs
IncludesNumerical solvers: PETSc, ScaLAPACK, Aztec, ...Structural frameworks: Software that manages data
& communication like Overture and Global ArraysRuntime & support tools: CUMULUS, TAU
Eclipse : Parallel Tools Platform (PTP)
open-source project: wide variety of parallel tools
http://www.eclipse.org/ptp/Stout and Jablonowski – p. 300/324
USING PARALLEL SYSTEMS
In addition to programming, there are many issuesconcerning the use of parallel systems.
For example, they are often a centralized resource thatmust be shared, much like mainframes of olden days.
Your institution may decide to purchase a system, or buytime elsewhere.
Stout and Jablonowski – p. 301/324
Batch Queuing
A return to 60’s style computer usage.
Large parallel systems use batch queuing, may allowsmall interactive jobs for debugging.
If there are multiple queues, learn how they arestructured and serviced — it’s you vs. them.
If submit several jobs at once, you may be your ownbottleneck. Might improve throughput by requestingfewer processors, and more time, per job(remember Amdahl’s Law).
Stout and Jablonowski – p. 302/324
Access to Systems
Academics: Can apply for free time at NSF supercomputingcenters or perhaps your own university. For modesttime the NSF process easy and quick, but thousands ofhours requires more detailed application. Need to showcan effectively utilize machines (e.g., speedup curves),and are doing good research.
Grants from other agencies usually include access totheir large systems.
Businesses: Can purchase time from hardware vendors,sometimes from university centers.
Stout and Jablonowski – p. 303/324
Purchasing Systems
Buying systems very complicated. Some questions:
Can it run your major applications? May depend onISVs.
Will vendor be around in five years?
Is there an upgrade path if you need to expand soon?
Can you get (and afford) tools, compilers, libraries fordeveloping new applications?
Is the system reliable? Is maintenance policyacceptable?
Do you have sufficient power and air conditioning?
Stout and Jablonowski – p. 304/324
What to Buy?
How much of the budget on processors vs. memory vs.communication?
Do you want more, or faster, processors, i.e.,price-performance or performance?
Need to understand major applications, and deliveredversus peak performance.
Stout and Jablonowski – p. 305/324
Where are You on the Curve?
As a user or buyer:
Price/Performance
Processors
Speedu
Performancep
Stout and Jablonowski – p. 306/324
Cluster Systems
Some groups build their own, resources are available tohelp.
However, many users just look at machine cost.
Typically total costs at least twice initial costs .
Maintenance:
Many little things, hardware and software, go wrong orneed upgrading — who will keep fixing this?
Who does backups?
Maintenance time-consuming and harmful to career
Stout and Jablonowski – p. 307/324
WRAP UP
We’ll review some of the material learned, discuss somegeneral problems with parallel computing, and point outsome trends in the area.
Stout and Jablonowski – p. 308/324
Trends in Parallel Computing
It’s useful to have a sense of where it is going.
Stout and Jablonowski – p. 309/324
Trends in Parallel Computing
It’s useful to have a sense of where it is going.
Stout and Jablonowski – p. 309/324
Trend: Power Critical
For given technology, typically power ≈ speed2
Heat limits density
Power In = Heat Out, so AC demands also increase
System speed requires close components: systemssuch as BlueGene make tradeoff, slower clock speed,and smaller RAM, for greater density and moreprocessing power.
Tradeoff opposite programmer needs — Amdahl’s law.
Stout and Jablonowski – p. 310/324
Power Trend
Stout and Jablonowski – p. 311/324
Trend: Chip Density Still Increasing
Hardware designers running out of old tricks, so justreplicate processors on chips — multi-core, many-core.
While potential chip performance continues to increase,number of I/O wires/chip doesn’t match number cores,more stress on cache locality
GPUs (and IBM Cell) have large number of simpleprocessors, very high FLOPs, need locality for efficientvector-like operations.
Stout and Jablonowski – p. 312/324
Chip Trends
Sources: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
Stout and Jablonowski – p. 313/324
Good Optimization still Bleeding Edge
Economics pushes for using commodity parts,especially since they have high potential. Unfortunately
No useful GPU programming standardsMulticores differ on caching providedNo good way to optimize for both GPU and multicore— portable optimization not yet attained
Need better compilers to exploit parallelism (e.g., muchsmarter OpenMP compilers)
Need better ways of expressing parallelism (askDARPA, Intel, Microsoft, etc.!)
Stout and Jablonowski – p. 314/324
More Trends
Roadrunner grabs the headlines, but clusters andSMPs most important economically, “commodity” partsincludes chips, boards, blades,...
Increasing use of commercial parallelized software
Some parallel computing companies will fail.
Stout and Jablonowski – p. 315/324
Should You Parallelize?
Parallel programming is difficult — is it worthwhile?Pancake [1996] suggests first determining:
How often is program used between changes?
How much time does it take (or is expected to take)?
How satisfied are users with current results?Need more resolutionNeed results fasterWill be flooded with data,. . .
Stout and Jablonowski – p. 316/324
Degrees of Difficulty
Some problems much easier to parallelize than others.Classes of problems range from
Embarrassingly parallel Separate jobs with no interaction,easy to run on any system.
Static Important load-balancing parameters, such as size,known in advance. Often run same configuration manytimes.
Data-dependent Dynamic Often quite difficult to achieveefficient implementation.
Stout and Jablonowski – p. 317/324
Review: Software Engineering
Standard languages (e.g., MPI, OpenMP) and toolsreduce learning curve and preserve investment.
Start with overview of data structures & timerequirements, do profiling as needed.
Prioritize sections to be parallelized, and adapt as youlearn.
Parallelize at the outermost loop possible
Proceed incrementally, constantly verifying correctness
Stout and Jablonowski – p. 318/324
Review: Efficiency
Reduce communication costs:maximize data localityeliminate false sharing in shared memory systemscombine messages to reduce overhead andsynchronizationsend data (distributed memory) or write data (sharedmemory) early, receive or read late.
Reduce load imbalance and synchronization.
Utilize compiler optimizations, optimized routines, etc.
Stout and Jablonowski – p. 319/324
If It Isn’t Working Well . . .
The original program probably wasn’t written withparallelism in mind
See if there is a more parallelizable approach
Sometimes parallelizable approaches aren’t the mostefficient ones available for serial computers, but that isOK if you are going to use many processors.
Stout and Jablonowski – p. 320/324
If It Isn’t Working Well . . .
The original program probably wasn’t written withparallelism in mind
See if there is a more parallelizable approach
Sometimes parallelizable approaches aren’t the mostefficient ones available for serial computers, but that isOK if you are going to use many processors.
Remember Amdahl’s Law:
Efficient massive parallelism is difficult.
Stout and Jablonowski – p. 320/324
Finally • • •
Stout and Jablonowski – p. 321/324
Finally • • •
Make sure your goals arerealistic, and remember thatyour own time is valuable.
Stout and Jablonowski – p. 321/324
REFERENCES
Sellected web resources for parallel computing are(occasionally) maintained at
http://www.eecs.umich.edu/ ˜ qstout/parlinks.html
Stout and Jablonowski – p. 322/324
References
[G. Amdahl 1967], “Validity of the single processor approach to achieving large scalecomputing capabilities”, AFIPS Conf. Proc. 30 (1967), pp. 483–485.
Co-Array Fortran: http://www.co-array.org .
[M.J. Flynn 1966], “Very high-speed computing systems”, Proc. IEEE 54 (1966),pp. 1901–1909.
[J.L. Gustafson 1988], “Reevaluating Amdahl’s Law”, Communications of the ACM 31(1988), pp. 532–533.
Hadoop: http://hadoop.apache.org/core/
Hilbert space-filling curve: see the routines available in Zoltan (listed below).
[MacNeice et al. 2002], “PARAMESH: A parallel adaptive mesh refinement communitytoolkit”, Comp. Physics Commun. 128 (2000), pp. 330–354.
Metis and Parmetis: http://www.cs.umn.edu/ ˜ karypis/metis/
MPI: documentation at http://www.mpi-forum.org/
Free, portable versions at:http://www.mcs.anl.gov/research/projects/mpich2
http://www.open-mpi.org/
Stout and Jablonowski – p. 323/324
References continued
OpenMP: http://openmp.org/wp/ .
[C.M. Pancake 1996], “Is parallelism for you?”, IEEE Comp. Sci. & Engin., 3 (1996)pp. 18–37.
[Pancake, Simmons, and Yan 1995], “Performance evaluation tools for parallel anddistributed systems, IEEE Computer, Vol. 28, No. 11 (1995) pp. 16–20.
Parallel computing, a slightly whimsical explanationhttp://www.eecs.umich.edu/ ˜ qstout/parallel.html
Roadrunner: http://www.lanl.gov/roadrunner/index.shtml
[Stout et al. 1997], “Adaptive blocks: A high-performance data structure”, Proc. SC’97.http://www.eecs.umich.edu/ ˜ qstout/abs/SC97.html
Top500. Website with extensive collection of references: http://www.Top500.org
UPC (Unified Parallel C): http://upc.gwu.edu .
Zoltan (collection of routines for load balancing et al.).http://www.cs.sandia.gov/Zoltan .
Stout and Jablonowski – p. 324/324