quentin f. stout christiane jablonowski · atm networks, digital multimedia parallel computers can...

Parallel Computing 101

Quentin F. Stout Christiane Jablonowski

University of Michigan

Copyright c© 2008

Stout and Jablonowski – p. 1/324

Organization

Part I Introduction, TerminologyExample (crash simulation)Speedup and Efficiency, Amdahl’s LawArchitecturesDistributed Memory Communication, MPIParallelizing Serial Programs ILoad Balancing IShared Memory, OpenMP


Organization cont.

Part II Hybrid ComputingVector Computing, Climate ModelingParallelizing Serial Programs IILoad Balancing IIData Intensive ComputingPerformance Improvement, ToolsUsing and Buying Parallel SystemsReview, Wrapup


INTRODUCTION

In this part we introduce parallel computing and someuseful terminology. We examine many of the variations insystem architecture, and how they affect the programmingoptions.

We will look at a representative example of a largescientific/engineering code, and examine how it wasparallelized. We also consider some additional examples.


Why use Parallel Computers?

Parallel computers can be the only way to achievespecific computational goals at a given time.

PetaFLOPS and Petabytes for Grand Challengeproblemskilo-transactions per second for search engines,ATM networks, digital multimedia

Parallel computers can be the cheapest or easiest wayto achieve a specific computational goal at a given time:e.g., cluster computers made from commodity parts.

Parallel computers can be made highly fault-tolerantnonstop computing at nuclear reactorsweb search


Why Parallel Computing — continued

The universe is inherently parallel, so parallel models fitit best.

Physical processes occur in parallel:weather, galaxy formation, nuclear reactions,epidemics, . . .

Social/work processes occur in parallel:ant colonies, wolf packs, assembly lines, stockexchange, tutorials, . . .


Basic Terminology and Concepts

Caveats

The definitions are fuzzy, many terms are notstandardized, definitions often change over time.

Many algorithms, software, and hardware systems donot match the categories, often blending approaches.

No attempt to cover all models and aspects of parallelcomputing. For example, quantum computing notincluded.


Parallel Computing Thesaurus

Parallel Computing Solving a task by simultaneous use ofmultiple processors, all components of a unifiedarchitecture.

Embarrassingly Parallel Solving many similar, butindependent, tasks. E.g., parameter sweeps.

Symmetric Multiprocessing (SMP) Multiple processors sharinga single address space and access to all resources.

Multi-core Processors Multiple processors (cores) on a singlechip. Aka many-core. Heterogeneous multi-core chipswith GPU being developed.

Cluster Computing Hierarchical combination of commodityunits (processors or SMPs) to build parallel system.


Thesaurus continued

Supercomputing Use of the fastest, biggest machines tosolve large problems. Historically vector computers, butnow are parallel or parallel/vector.

High Performance Computing Solving problems viasupercomputers + fast networks + visualization.

Pipelining Breaking a task into steps performed by differentunits, with inputs streaming through, much like anassembly line.

Vector Computer Operation such as multiply broken intoseveral steps and applied to a stream of operands(pipelining with “vectors”).


Pipelining, Detroit Style


Who Uses Supercomputers?

Historically, the military (nuclear simulations, cryptography).Weather forecasting was the civilian application.

These continue to be major users but now many morecivilian users.

The following charts are from the Top 500 list, showing thestatus as of June. The newest list has just been announcedand is on the Top500 website:

http://www.top500.org


http://www.top500.org

Top500: Performance


Top500: Application Systems


Top500: Architecture Systems


Top500: Vendor Systems


CRASH SIMULATION

A greatly simplified model, based on parallelizing crashsimulation for Ford Motor Company. Such simulations savea significant amount of money and time compared to testingreal cars.

This example illustrates various phenomena which arecommon to a great many simulations and other large-scaleapplications.


Finite Element Representation

Car is modeled by a triangulated surface (theelements).

The simulation consists of modeling the movement ofthe elements during each time step, incorporating theforces on them to determine their new position.

In each time step, the movement of each elementdepends on its interaction with the other elements that itis physically adjacent to.


The Car of the Future


Basic Serial Crash Simulation

1 For all elements

2 Read State(element), Properties(element),

Neighbor_list(element)

3 For time=1 to end_of_simulation

4 For element = 1 to num_elements

5 Compute State(element) for next time step,

based on previous state of element and its

neighbors, and on properties of element

Periodically State is stored on disk for later visualization.Stout and Jablonowski – p. 19/324

Simple approach to parallelization

Parallel computer based on PC-like processors linked witha fast network, where processors communicate viamessages. Distributed memory or message-passing .

Cannot parallelize time, so parallelize space.

Distribute elements to processors, each processor updatesthe positions of the elements it contains: owner computes .

All machines run the same program: SPMD , singleprogram multiple data.

SPMD is the dominant form of parallel computing.


A Distributed Car


Basic Parallel Version

Concurrently for all processors P

1 For all elements assigned to P

2 Read State(element), Properties(element),

Neighbor-list(element)

3 For time=1 to end-of-simulation

4 For element = 1 to num-elements-in-P

5 Compute State(element) for next time step,

based on previous state of element and its

neighbors, and on properties of element


Software Engineering Aspects

Most parallel code the same as, or similar to, serial code,reducing parallel development and life-cycle costs, andhelping keep parallel and serial versions compatible.

Life-cycle costs are often overlooked until it is too late!

Note that high-level structure same as serial version: asequence of steps. The sequence is a serial construct, butsteps are performed in parallel.


Some Basic Questions: Allocation

How are elements assigned to processors?


Some Basic Questions: Allocation

How are elements assigned to processors?

Typically element assignment determined by serialpreprocessing, using domain decompositionapproaches (load-balancing) described later.


Separation?

How does processor keep track of adjacency info forneighbors in other processors?


Separation?

How does processor keep track of adjacency info forneighbors in other processors?

Use ghost cells (halo ) to copy remote neighbors, addtranslation table to keep track of their location andwhich local elements copied elsewhere.


Ghost Cells


Update?

How does a processor use State(neighbor) when it doesnot contain the neighbor element?


Update?

How does a processor use State(neighbor) when it doesnot contain the neighbor element?

Could request state information from processorcontaining the neighbor. However, more efficient if thatprocessor sends it.


Coding and Correctness?

How does one manage the software engineering of theparallelization process?


Coding and Correctness?

How does one manage the software engineering of theparallelization process?

Utilize an incremental parallelization approach.

Constantly check test cases to make sure answerscorrect.


Efficiency?

How do we evaluate the success of the parallelization, andif not successful, how do we improve it?


Efficiency?

How do we evaluate the success of the parallelization, andif not successful, how do we improve it?

Evaluate via speedup or efficiency metrics, improve viaprofiling, iterative refinement.


Evaluating Parallel Programs

An important component of effective parallel computing isdetermining whether the program is performing well. If it isnot running efficiently, or cannot be scaled to the targetnumber of processors, then one needs to determine thecauses of the problem and develop better approaches.


Definitions

For a given problem A, let

SerTime(n) = Time of best serial algorithm to solve A forinput of size n.

ParTime(n,p) = Time of the parallel algorithm+architecture tosolve A for input of size n, using p processors.

Note that SerTime(n) ≤ ParTime(n,1).

Speedup: SerTime(n) / ParTime(n,p)

Work (cost): p · ParTime(n,p)

Efficiency: SerTime(n) / [p · ParTime(n,p)]


In general, expect:

0 < Speedup ≤ p

Serial Work ≤ Parallel Work < ∞0 < Efficiency ≤ 1

Technically, speedup is linear if there is a constant c > 0 sothat speedup is at least c · p. However, many use this termto mean c = 1.

Always involves some restriction on relationship of p and n,e.g., p ≤ n, or p =

√n.


Observed Speedup

Number of Processors

S

p

e

e

d

u

p

Per

fect

Occasional

Common


Superlinear Speedup

Very rare. Some reasons for speedup > p (efficiency > 1)

Parallel computer has p times as much RAM so higherfraction of program memory in RAM instead of disk.An important reason for using parallel computers

In developing parallel program a better algorithm wasdiscovered, older serial algorithm was not best possible.A useful side-effect of parallelization

Parallel computer is solving slightly different, easierproblem, or providing slightly different answer.Questionable practice


Amdahl’s Law

Amdahl [1967] noted: given a program, let f be fraction oftime spent on operations that must be performed serially.Then for p processors,

Speedup(p) ≤ 1

f + (1− f)/p.

(Right hand side assumes perfect parallelization of (1-f) part of program)

Thus no matter how many processors are used:

Speedup ≤ 1/f

Unfortunately, typically f was 10 – 20%


Useful rule of thumb:

If maximal possible speedup is S, then Sprocessors run at about 50% efficiency.


Maximal Possible Speedup

1 2 4 8 16

32 64

128 256

512 1024

Processors

1

2

4

8

16

32

64

128

256

512

1024

Spe

edup

f=0.1 f=0.01 f=0.001


Maximal Possible Efficiency

1 2 4 8 16

32 64

128 256

512 1024

Processors

0.0

0.2

0.4

0.6

0.8

1.0

1.2 E

ffici

ency

f=0.1 f=0.01 f=0.001


Amdahl Was an Optimist

Parallelization usually adds work, typically communication,which reduces speedup.

For example, crash simulation typically runs for a fixedsimulated time interval. Due to the physics of the situation,if use n finite elements, number of time steps grows like√

n, so serial processor time grows like

C1 · n1.5

for some C1 > 0.


Additional Parallel Communication

Suppose use p processors. Every time step processorsreceive and send information about border elements. Thereis also periodic global communication of total energy,contact, etc.

For simple approaches, communication time grows like

√n

(

C2 · p + C3

√

n/p)

, C2, C3 > 0


Effect of Communication

Suppose C2 = C1 = 10 and C3 = 1. Then for n = 1000 weget the following speedup.

1 2 4 8 16

32 64

128 256

512 1024

Processors

0

5

10

15

20 S

peed

up


Amdahl was a Pessimist

Amdahl convinced many that general-purpose parallelcomputing was not viable. Fortunately, we can skirt the law.

Algorithm: May be new algorithms with much smallervalues of f — Necessity is the mother of invention.

Memory hierarchy: Possibility more time spent in RAM thandisk — Superlinear Speedup.

Scaling: Usually time spent in serial portion of code is adecreasing fraction of the total time as problem sizeincreases — Scaling.


Common Program Structure

Serial, grows slowly with n

Serial, grows slowly with n

Parallelizable loop, grows with n

Parallelizable loop within loopgrows very rapidly with n

Serial, fixed time

Sometimes serialportions grow withproblem size butmuch slower thanthe total time.

I.e., Amdahl’s“f” decreases asn increases


Scaling

For such programs, can often exploit large parallelmachines by scaling the problems to larger instances.

To illustrate, use a model like the crash simulation

SerT ime(n) = 10 · n1.5

and the time for p parallel processors grows like

ParT ime(n, p) = 10 · n1.5/p + 10 · p√

n + n/√

p


Fixed Size per Processor

Fixing the amount of data per processor usually giveshighest efficiency possible, hence it is commonly cited.Called weak scaling .

Suppose each processor can hold 1000 elements.Constant Size per Processor

1 2 4 8 16

32 64

128 256

512 1024

Processors

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Effi

cien

cy


Fixed TimeFix time, find largest problem solvable. Commonly used inevaluating database servers, transactions per second.[Gustafson 1988] considered this for general computing.

Fix time to be SerTime(1000).Constant Time

1 2 4 8 16

32 64

128 256

512 1024

Processors

4000

8000

12000 16000

n


Fixed Efficiency

Fix efficiency, find smallest problem needed to achieve thatefficiency (isoefficiency analysis).

For example, for 90% efficiency:Constant Efficiency of 0.9

1 2 4 8 16

32 64

128 256

512 1024

Processors

100

1000

10000

100000

1000000

10000000

n


Scalability

Linear speedup is very rare, due to communicationoverhead, load imbalance, algorithm/architecture mismatch,etc.

Several attempts have been made to give definitions forscalable architectures, algorithms, or algorithm-architecture combinations. However, for most users, theimportant question is:

Have I achieved acceptable performance on mysoftware/hardware system for a suitable range of data andmachine sizes?


ARCHITECTURAL TAXONOMIES

These classifications provide ways to think about problemsand their solution.

The classifications are in terms of hardware, but there arenatural software analogues.

Note: many systems blend approaches, and do not exactlycorrespond to the classifications.


Flynn’s Instruction/Data Taxonomy

[Flynn, 1966] At any point in time can have{

S

M

}

I

{

S

M

}

D

SI Single Instruction: All processors execute the sameinstruction. Usually involves a central controller.

MI Multiple Instruction: Different processors may beexecuting different instructions.

SD Single Data: All processors are operating on the samedata.

MD Multiple Data: Different processors may be operating ondifferent data.


SISD: standard serial computer and program.

MISD is rare — some extreme fault-tolerance schemes,using different computers and programs to operate onsame input data, are of this type.

Almost all parallel computers are MIMD.

SIMD: there used to be companies that made suchsystems (Thinking Machines’ Connection Machine wasthe most famous).

Vector computing is a form of SIMD.


Instructions

A SIMD System

Processors, with data

Controller, with program


SIMD Software

Data parallel software — do the same thing to all elementsof a structure (e.g., many matrix algorithms). Easy to writeand understand. Unfortunately, difficult to apply to complexproblems (as were the SIMD machines).

SPMD, Single Program Multiple Data : can be viewed as anextension of the SIMD approach to programming for MIMDsystems.


Memory Systems: Distributed Memory

All memory is associated with processors.

To retrieve information from another processor’smemory a message must be sent over the network tothe home processor. Usually organize program so thatthe owner sends it to the requestor before being asked.

Advantages:Memory is scalable with number of processorsEach processor has rapid access to its own memorywithout interference or cache coherency problemsCost effective and easier to build: can usecommodity parts


Disadvantages

Programmer is responsible for many of the details of thecommunication, easy to make mistakes.

May be difficult to distribute the data structures, oftenneed to revise them to add additional pointers.


Memory Systems: Shared Memory

Global memory space, accessible by all processors

Processors may have local memory to hold copies ofsome global memory.

Consistency of these copies is usually maintained byhardware.

Advantages:Global address space is user-friendly, program maybe able to use global data structures efficiently andwith little modification.Data sharing between tasks is fast


Disadvantages

System may suffer from lack of scalability betweenmemory and CPUs. Adding CPUs increases traffic onshared memory - to - CPU path. This is especially truefor cache coherent systems

Programmer is responsible for correct synchronization

Needs some special-purpose components.


Shared vs. Distributed

Network

Processor + Cache

Memory

Processor + Cache + Memory

Network

DISTRIBUTED MEMORY

SHARED MEMORY


Shared Memory Access Time

Two classes of SM systems based on memory access time:

Uniform Memory Access (UMA):

Most commonly represented by Symmetric Multi-processor Machines (SMP), identical processors

Equal access times to memory

Some systems are CC-UMA (cache coherent UMA): ifone processor updates a variable in shared memory, allthe other processors know about the update.


SM Access Time continued

Non-Uniform Memory Access (NUMA):

Often made by physically linking two or more SMPs

One SMP can directly access memory of another SMP(not message-passing)

Memory access times are not uniform, memory accessacross a link is slower

Cache coherent systems: CC-NUMA


Shared Memory on Distributed Memory

As we’ll see later, it is usually easier parallelize a programon a shared memory system.

However, most systems are distributed memory because ofthe cost advantages.

To gain both advantages people have investigated virtualshared memory , or global address space (GAS) , usingsoftware to simulate shared memory access.

Current projects include Unified Parallel C (UPC) andCo-Array Fortran.


Virtual Shared Memory Performance

Communication time in distributed memory machines isquite high. Thus virtual shared memory access is highlynonuniform, being vastly faster if the data is stored with theprocessor requesting it.

Because of these access delays, the performance of thesesystems is not good, even if reasonable care is taken, butmay be justified by greatly reduced programmer time.


Virtual Shared Memory Performance

Communication time in distributed memory machines isquite high. Thus virtual shared memory access is highlynonuniform, being vastly faster if the data is stored with theprocessor requesting it.

Because of these access delays, the performance of thesesystems is not good, even if reasonable care is taken, butmay be justified by greatly reduced programmer time.

Software and hardware models need not match, thoughthere are often performance problems when they don’t.


Communication Network

There are a many ways that the processors can beinterconnected but for the user the differences are usuallyminor. Two main classes that do have some impact:

Bus Processors (and memory) connected to a common busor busses, much like a local Ethernet.

Memory access fairly uniform, but not very scalabledue to contention.

Switching Network Processors (and memory) connected torouting switches like in telephone system.

Usually NUMA, blocking, though a cross-bar isnon-blocking (but a cross-bar is not scalable).


Networks

Switch Processor

MultistageInterconnect

Bus


Example: Symmetric Multiprocessors

Shared memory system, processors share work.

When a processor reads or writes to RAM, datatransported over a bus, local copy in processor cache.

Rules needed to ensure that different caches don’tcontain different values for the same memory locations(cache coherency). This is easier on bus-basedsystems than on more general interconnectionnetworks.

Because all processors use the same memory bus,there is limited scalability due to bus contention.

Multicore processors, which are SMPs, are becomingthe standard processors in all systems.


Low-Cost Parallel Systems

Systems built from commodity parts are becomingwidespread due to low cost and acceptable performance.

Clusters (NOW, Beowulfs, etc.): commodity processorboards with multicore processors and commodityinterconnects (e.g., Gigabit Ethernet). Often rackmounted.

SMPs: quite common as departmental servers.

Clusters of SMP nodes: rapidly gaining in importance,small SMPs available rack-mounted. Sometimes calledclumps.


However, communication on low-cost clusters often slow,typically due to software which relies on basic networkingstack. Some companies (Myrinet, Force10, etc.) markethigh-speed networks and special software to reduce this.

Constellations use much larger, much more expensive,shared memory units as nodes in a distributed memorysystem. Usually a high-performance interconnect is usedbetween the nodes.

Note: Many clusters are primarily used for embarrasingly parallelcomputation and do not need high-performance networking.


The Memory Hierarchy

The mismatch of processor speed and memory speedcauses a bottleneck. There is an inverse relationshipbetween memory speed and $/byte, and there are physicalconstraints on the size of memory. Thus memory arrangedin a hierarchy:

registers

cache (perhaps itself hierarchical)

RAM (“primary memory”)

disk (“secondary memory”)

tapes or CDs (“tertiary memory”)


Speed-Size Tradeoff

Cache MByte

Ram

Disk 10 millisec

Tape minute 100 TByte

100 GByte

GByte

nanosec

100 nanosecSpeed

Size

When moving between levels beyond the registers, anentire block is moved at once (cache lines, pages).Effective high-performance computing (serial or parallel)includes arranging data and program so that entire block isused while resident in the faster memory.


Multiprocessor Caching

Parallel computing compounds the memory hierarchy:remote memory is far slower to access than local memory.

Caching widely used, fetching blocks of data instead ofindividual items. Data fetched when referenced, sometimesprefetched before it is needed.

If data locality is high then effective memory accesstime is decreased

Reduces network traffic.

However, creates a cache coherence problem,

False sharing caused by cache lines can significantlydegrade performance.


MESSAGE PASSING

On distributed memory systems, also called messagepassing systems, communication is often an importantaspect of performance and correctness.


Communication Speed

On most distributed memory systems, messages arerelatively slow, with startup (latency) times taking thousandsof cycles (and far more for many clusters).

Typically, once the message has started, the additional timeper byte (bandwidth) is relatively small.


Measured Performance

For example, a 4.7 GHz IBM Power 6 (p575) processor,best case MPI messages (discussed later):

processor speed: 4700 cycles per microsecond (µsec),4 flops/cycle, 18800 flops per µsec.

MPI message latency, caused by software:≈ 1.3 µsec = 24,400 flops

message bandwidth, usually limited by hardware:≈ 2500 bytes per µsec = 7.5flops/byte

Your performance may vary!


Reducing Latency

Reducing the effect of high latency often important forperformance. Some useful approaches:

Reduce the number of messages by mappingcommunicating entities onto the same processor.

Combine messages having the same sender anddestination.

If processor P has data needed by processor Q, have Psend to Q, rather than Q first requesting it. P shouldsend as soon as data ready, Q should read as late aspossible to increase probability data has arrived.

Send Early, Receive Late, Don’t Ask but Tell.


Messages and Computations

Even when data is sent far in advance of its use, messagepassing can cause performance degradation. Can try tooverlap communication and calculation.

Unfortunately:

Many systems incapable of doing this.

Latency dominantly due to software, initiating messageties up processor.

Even with co-processor, memory bus may be tied up,interfering with main processor’s use of it.

Expensive communication systems try to ovecome theseproblems.


Deadlock

If messages blocking , i.e., if processor can’t proceed untilthe message is finished, then can reach deadlock , whereno processor can proceed.

Example: Processor A sends message to B while B sendsto A. If blocking sends, neither finishes until the otherfinishes receiving, but neither starts receiving until sendfinished.

This can be avoided by A doing send then receive, while Bdoes receive then send. However, often difficult tocoordinate when there are many processors.


Often easiest to prevent deadlock by non-blockingcommunication, where processor can send and proceedbefore receive is finished.

However, requires receiver buffer space which may fill,reducing to blocking case, and extra copying of messages,reducing performance.


Message Passing Interface — MPI

An important communication standard. We will show somesnippets of MPI to illustrate some of the issues, but MPI is amajor topic that we cannot address in detail. Fortunately,many programs need only a few MPI features. There aremany implementations of MPI:

MPICH homepage http://www-unix.mcs.anl.gov/mpi

Open MPI homepage http://www.open-mpi.org/


http://www-unix.mcs.anl.gov/mpi

http://www.open-mpi.org/

Some Reasons for Using MPI

Standardized, with process to keep it evolving.

Available on almost all parallel systems (free MPICH,Open MPI used on many clusters), with interfaces for Cand Fortran.

Supplies many communication variations and optimizedfunctions for a wide range of needs.

Supports large program development and integration ofmultiple modules.

Many powerful packages and tools based on MPI.


While MPI large (> 100 functions), usually need veryfew functions (6-10), giving gentle learning curve.

Various training materials, tools and aids for MPI.

Good introductory MPI tutorialhttp://www.llnl.gov/computing/tutorials/mpi/

Basic and advanced MPI tutorials, e.g. on I/O andone-sided communicationhttp://www-unix.mcs.anl.gov/mpi/tutorial/


http://www.llnl.gov/computing/tutorials/mpi/

http://www-unix.mcs.anl.gov/mpi/tutorial/

While MPI large (> 100 functions), usually need veryfew functions (6-10), giving gentle learning curve.

Various training materials, tools and aids for MPI.

Good introductory MPI tutorialhttp://www.llnl.gov/computing/tutorials/mpi/

Basic and advanced MPI tutorials, e.g. on I/O andone-sided communicationhttp://www-unix.mcs.anl.gov/mpi/tutorial/

Writing MPI-based parallel codes helps preserve yourinvestment as systems change.


http://www.llnl.gov/computing/tutorials/mpi/

http://www-unix.mcs.anl.gov/mpi/tutorial/

MPI Basics

The overwhelmingly most frequently used MPIcommands are variants of

MPI_SEND() to send data, andMPI_RECV() to receive it.

These function very much like write & read statements.

Point-to-point communication

MPI_SEND() and MPI_RECV() are blocking operations.

Blocking communication can be unsafe and may lead todeadlocks.


Blocking MPI Communication

MPI_SEND() does not complete until thecommunication buffer is empty

MPI_RECV() does not complete until thecommunication buffer is full

Send-recv handshake works for small messages, butmight fail for large messages

Allowable size of the message depends on MPIimplementation (buffer sizes), could also behardware-dependent

Even if it works, the data usually get copied into amemory buffer

Copies are slow (avoid), poor performance


Non-Blocking MPI Communication

Better solution: use non-blocking operations

MPI_ISEND()MPI_IRECV()MPI_WAIT()

The user can also check for the data at a later stage inthe program without waiting:

MPI_TEST()

Non-blocking operations boost the performance.

Other non-blocking send and receive operationsavailable.

Possible overlap of communication with computation.

However, few system can provide the overlap, oftenalready limited by the memory bandwidth.


MPI Initialization

Near the beginning of the program, include

#include "mpi.h"MPI_Init(&argc, &argv)MPI_Comm_rank(MPI_COMM_WORLD, &my_rank)MPI_Comm_size(MPI_COMM_WORLD,

&num_processors)

These help each processor determine its role in the overallscheme.

There is MPI_Finalize() at the end.

These 4 MPI functions, together with MPI send and receiveoperations, are already sufficient for simple applications.


MPI Example

Each processor sends value to proc. 0, which adds them.

0

1

2

3

4

5

6

7

8


Basic Program

initializeif (my_rank == 0){

sum = 0.0;for (source=1; source<num_procs; source++){

MPI_RECV(&value, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);

sum += value;}

} else {MPI_SEND(&value, 1, MPI_FLOAT, 0, tag,

MPI_COMM_WORLD);}finalize


Improving Performance

In the initial version, processor 0 received the messages inprocessor order. However, if processor 1 delayed sendingits message, then processor 0 would also be delayed.

For a more efficient version: modify MPI_RECV to

MPI_Recv(&value, 1, MPI_FLOAT,MPI_ANY_SOURCE, tag,MPI_COMM_WORLD, &status);

Now processor 0 can start processing messages as soonas any arrives.


Reduction Operations

Operations such as summing are common, combining datafrom every processor into a single value. These reductionoperations are so important that MPI provides directsupport for them, and parallelizing compilers recognizethem and generate efficient code.

Could replace all communication with

MPI_REDUCE(&value, &sum, 1, MPI_FLOAT,MPI_SUM, 0, MPI_COMM_WORLD)

Examples of Collective Operations:

MPI_SUM, MPI_MAX, MPI_MIN, MPI_PROD

MPI_LAND (logical and), MPI_LOR (logical or)


Collective Communication

The opposite of reduction is broadcast : one processorsends to all others.

Reduction, broadcast, and others are collectivecommunication operations, the next most frequentlyinvoked MPI routines after send and receive.

MPI collective communication routines improve clarity, runfaster, and reduce chance of programmer error.



Broadcast

AP0

P3

P2

P1 P1

P0

P2

P3

P3

P2

P1 P1

P0

P3

P0

P2

A

A

A

A

A A

C

D

B C D

B

Scatter

Gather



All gatherAP0

P3

P2

P1 P1

P0

P2

P3

P3

P2

P1 P1

P0

P3

P0

P2

A

A

A

A

B

C

D

B C D

B

B

B

C

C

C

D

D

D

All to all

A0 A0

C3

A1 A2 A3

B0 B1 B2 B3

C0 C1 C2

D0 D1 D2 D3

B0 C0 D0

D1C1B1A1

A2

A3 B3

B2 C2

C3

D2

D3


MPI Synchronization

Synchronization is provided

implicitly byBlocking communicationCollective communication

explicitly byMPI_Wait, MPI_Waitany operations for non-blockingcommunication:May be used to synchronize a few or all processorsMPI_Barrier statement:Blocks until all MPI processes have reached barrier

Avoid synchronizations as much as possible to boostperformance.


MPI Datatypes

Predefined basic datatypes, corresponding to theunderlying programming language, examples are

FortranMPI_INTEGERMPI_REAL, MPI_DOUBLE_PRECISION

CMPI_INTMPI_FLOAT, MPI_DOUBLE

Derived data types:Vector: data separated by constant strideContiguous: vector with stride 1Struct: general mixed types (e.g. for C struct)Indexed: Array of indices


MPI Datatype: Vector

Consider a block of memory (e.g. a matrix with integernumbers):

10

15

5

2

3

4 8

9

11

12 16 20 24

2319

18 22

211713

146

7

1

To specify the gray row (in Fortran order), useMPI_Type_vector( count, blocklen, stride, old_datatype,

new_datatype, ierr)MPI_Type_commit (new_datatype, ierr)


MPI Datatype: Vector

In the example, we get

MPI_Type_vector( 6, 1, 4, MPI_INTEGER,my_vector, ierr)

MPI_Type_commit (my_vector, ierr)

The new datatype my_vector is a vector that contains 6blocks, each of 1 integer number, with a stride of 4integers between blocks.

Here, we introduce the Fortran notation of the MPIroutines (with additional error flag ”ierr”).

Fortran, C and C++ notations are very similar.


Some Additional MPI Features

Procedures for creating virtual topologies, e.g., indexingprocessors as a 2-dimensional grid.

User-created communicators (e.g., replaceMPI_COMM_WORLD), useful for selective collectivecommunication (e.g., summing along rows of a matrix),incorporating software developed separately.

Support for heterogeneous systems, MPI convertsbasic datatypes.

Additional user-specified derived datatypes


MPI-2

The MPI-2.1 standard was just approved by the MPI Forumon September 4, 2008, updates the MPI-2.0 standard from1997. Important added features in MPI-2.x include

Parallel I/O Critical for scalability of I/O-intensive problems.

One-sided communication Essentially “put” and “get”operations that can greatly improve efficiency on somecodes. Conceptually these are are the same as directlyaccessing remote memory.

However, these are risky and can easily introduce raceconditions.


One-Sided Communication

Memory

Processor

Memory

ProcessorPut Get


MPI Summary

The MPI standard includes

point-to-point message-passing

collective communications

group and communicator concepts

process topologies (e.g. graphs)

environmental management (e.g. timers, error handling)

process creation and management

one-sided communications

external interfaces

parallel I/O routines

profiling interface


PARALLELIZATION I

Real code is long, complex. How do we engineer theparallelization process?

Usually there is a (perhaps vague) performance goal , not,per se, a parallelization goal.


Overview of Approach

Incremental approach: tackle a bit of the problem at a timeso that one can recover from mistakes and poor attempts.

Verify: Develop test cases and constantly checkresults.

Profile: to determine where time being spent. May becoupled with modeling of code to determine whereeffort will yield most reward.

Check-point/restart: Aids testing and debugging sincesome problems only occur late in the programexecution.


Serial Performance

Often profiling reveals serial performance problems —eliminating these may be critical to attaining performancegoals.

Doubling serial performance is far more useful thandoubling the number of processors

If possible, exploit parallel (or serial) libraries, since they areusually highly tuned for target machine.


Incremental Parallelization

In shared-memory machines, can often incrementallyparallelize and increase efficiency. Portions not parallelizedwill slow the program but will at least be correct. This is amajor advantage of shared memory over distributedmemory.

Some benefits to this approach:

Smaller changes make it is easier to locate mistakes.

It is easier to determine where efficiency is poor.

Should have test cases available and constantly verifycorrectness.


One continues incrementally until desired speedup isattained or it has been determined that the original goalimpractical. Straightforward effort/reward tradeoff, but rarelycarefully considered.

If performance critical, then often final shared memory codevery similar to distributed memory code.


Parallelization Process

PrioritizeChanges

Incrementallychange

Verifycorrectness

Performanceacceptable?

Code readyfor use

Analyze,profile

Set Goals

No

Yes


Process for Distributed Memory

While more complicated, an incremental approach can alsobe utilized for distributed memory machines.

It is harder to get started, but the basic approaches aresimilar. The first things one needs to do are

Do coarse-grained profiling, to determine the timeconsumed in the different sections of the program.

Develop maps of the major data structures and wherethey are used.

The profiling is used to prioritize the areas that need to beparallelized.


Parallelization Steps

Once parallelization plan ready, start parallelizing sectionsof code and data structures.

Initially, all processors have the complete standardserial data structures (global data structures).

As parallelize code and data structures (local datastructures), develop serial-parallel & parallel-serialconversion routines (scaffolding).

Verify correctness on test cases by showingserial–parallel–serial = serial

for global data structures.

Profile to see if efficiency of this piece is acceptable. Ifnot, then develop better alternative.


Incremental DM Parallelization

Serial-Parallel Conversion

Parallel-Serial Conversion

Parallel Code

Serial-Parallel Conversion

Parallel-Serial Conversion

Parallel Code

CodeSerial

Serial Code

Serial Code


Useful to retain the serial-parallel scaffolding (normallyturned off), to help maintain the correspondence betweenthe serial and parallel codes as they evolve.


Useful to retain the serial-parallel scaffolding (normallyturned off), to help maintain the correspondence betweenthe serial and parallel codes as they evolve.

This is probably a complex, important program, since it isworth parallelization effort. Therefore software engineeringconcerns, such as life-cycle maintenance, are veryimportant.


LOAD-BALANCING I

Here we address the question of how one goes aboutsubdividing the computational domain among theprocessors. We introduce the basic techniques that areapplicable to most programs, with some more advancedtechniques appearing later.


Unbalanced Load

0 1 2 3 4 5 6 7

workload per processor

Average

Which processor is the most important for parallelperformance?


Domain and Functional Decomposition

Domain decomposition Partition a (perhaps conceptual)space. Different processors do similar work on differentpieces (quilting bee, teaching assistants for discussionsections, etc.)

Functional decomposition Different processors work ondifferent types of tasks (workers on an assembly line,sub-contractors on a project, etc.)

Functional decomposition rarely scales to manyprocessors, so we’ll concentrate on domain decomposition.


Dependency Analysis

There is a dependency between A and B if value of Bdepends upon A. B cannot be computed before A.

Dependencies control parallelization options.


space

t i m e

Computational Dependencies


Space and Time

Almost always

Time or time-like variables and operations (signals,non-commutative operations, etc.) cannot beparallelized

Space or space-like variables and operations (names,objects, etc.) can be parallelized.

Some operations can have both time-like and space-likeproperties. E.g., ATM transactions are usually toindependent accounts (space-like), but ones to the sameaccount must be done in order (time-like).


Load-Balancing Variety

Many different types of load-balancing problems:

static or dynamic,

parameterized or data dependent,

homogeneous or inhomogeneous,

low or high dimensional,

graph oriented, geometric, lexicographic, etc.

Because of this diversity, need many different approachesand tools.


Complicating Factors

Objects being computed may not have a simpledependency pattern among themselves, makingcommunication load-balancing difficult to achieve.

Objects may not have uniform computationalrequirements, and it may not initially be clear whichones need more time.

If objects are repeatedly updated (such as the elementsin the crash simulation), the computational load of anobject may vary over iterations.

Objects may be created dynamically and in anunpredictable manner, complicating both computationaland communicational load balance.


Static Decompositions

Here we will consider only static decompositions of thework, with dynamic decompositions discussed later. Avariety of basic techniques available, each suitable for adifferent range of problems.

Often just evenly dividing space among the processorsyields acceptable load balance, with acceptableperformance if communication minimized. This approachworks even if the objects have varying computationalrequirements, as long as there are enough objects so thatthe worst processor is likely to be close to the average (lawof large numbers).


Which Matrix Decomposition is Best?

Suppose work at each position only depends on value thereand nearby ones, equivalent work at each position.

MinimizingBoundary

0 1 2 3

4 5 6 7

8 9 10

15

11

12 13 14

MinimizingNumber ofNeighbors

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


Matrix Decomposition Analysis

Computation proportional to area so both loadbalanced.

Squares minimize bytes communicated (parallelizationoverhead), so is generally better.

However: Recall, there is significant overhead instarting a message, especially on clusters, so farsmaller matrices may need to concentrate on number,not size, of messages, i.e., use strips.


Local vs. Global Matrices

If serial has matrix A[0 : n−1], and there are p DMprocessors, with ranks 0 . . . p−1

each processor has matrix A[0 : nlocal−1], wherenlocal = n/p

A[i] on processor p corresponds to A[i + p ∗ nlocal] inthe original array

if use A[i+1] and A[i−1] in calculation of A[i],(i 6= 0, n−1), then would have A[−1 : nlocal] to addghost cells


Linear Rank vs. 2-D Indices

To map processorranks 0..15 to

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

rows 0..3 andcolumns 0..3

For processor rank i, rowi = ⌊i/√p⌋ and coli = i− rowi ∗√

p

Right: (rowi, coli + 1), rank i + 1

Left: (rowi, coli − 1), rank i− 1

Up: (rowi − 1, coli), rank (rowi − 1) ∗ √p + coli

Down: (rowi + 1, coli), rank (rowi + 1) ∗ √p + coli

MPI “virtual topologies” can do this for you.


Graph Decompositions

Very general graph decomposition techniques can be usedwhen communication patterns less regular.

Objects (calculations) represented as vertices (withweights if calculation requirements uneven)

Communication represented as edges (with weights ifcommunication requirements uneven).

Goals:

1. assign vertices to processors to evenly distribute thenumber/weight of vertices, and

2. minimize and balance the number/weight of edgesbetween processors.


What is Best Decomposition?

3

1

4

2

1

3

2

55 6

2

6

6 1

4

5

2

5

5

432

2

4

3

4

3

3

36

5

4

1

6

3

1


Graph Decomposition Tools

Unfortunately, optimal graph decomposition is NP-hard.

Fortunately, various heuristics work well, and high-qualitydecomposition tools are available, such as Metis .

To use a serial tool such as Metis, convert data into formatit requires, run Metis to partition the graph vertices, thenconvert to format your program requires.

Scripts (Perl, Python, etc.) useful to convert formats.

Parallel version, ParMetis, also available.

http://www.cs.umn.edu/ ˜ karypis/metis/metis.html


http://www.cs.umn.edu/~karypis/metis/metis.html

Using Serial Decomposition Tool

matrix

partitioned

program input format

graph

graph as sparse

Program

Parallel

Convert

Convert

Problem

Metis


Where Do Weights Come From?

If weights are static and objects of the same type haveabout the same requirements, and if types are known inadvance, then:

Sometimes all the same.

Sometimes easy to deduce a priori.

May use simple measurements on small test cases.

May use statistical curve fitting on sample problems.

If types aren’t known in advance, this won’t be useful.


Static Geometric Decompositions

When the objects have an underlying geometrical basis,such as the finite elements representing surfaces of carparts, stars in a galaxy, wires in a VLSI layout, or polygonsrepresenting census blocks in a geographical informationsystem, then the geometry can often be exploited

if communication predominately involves nearby objects.

Geometric decompositions can be based on k-D trees,quad- or oct-trees, ham sandwich theorems, space-fillingcurves, etc., and can incorporate weights.


Recursive Bisectioning

Magenta points require twice as much work as cyan ones.


Recursive Bisectioning

Split work evenly along the x-axis (weighted median).


Recursive Bisectioning cont.

Split each side along y-axis, using median on that side.



Now split along x-axis (or z-axis if data 3-dimensional).



Cycle through axes until # pieces = # processors.



May decide to use only 1 or 2 dimensions to split along,similar to the strip partitioning for matrices.

Closely related to the k-D tree serial data structure.


Space-Filling Curves

The best general-purpose geometric load-balancing comesfrom space-filling curves.

The order in which points are visited in the space-fillingcurve determines how the geometric objects are groupedtogether to be assigned to the processors.


The Hilbert Space-Filling Curve

0 1

23

4

5 6

7 8

9 10

11

1213

14 15 16

17 18

19 20 21

2223

24 25

262728

2930

31

32

33 34

35 36 37

3839

40 41

424344

4546

4748

51

49

50

55 52

53545758

59 56

61

62

60

63

For an implementation, see the references.


Using A Space-Filling Curve

Letters represent work, boldface twice as much work.

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z


Step 1: Determine Space-Filling Coordinates

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z

A B C D E F G

H I J K L M N

O P Q R S T U

V W X Y Z

63

2

6

8 1923

242527

29

31

34 363839

40474849 50

525556

5853

59


Step 2: Sort by Space-Filling Coordinates

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z

A B C D E F G

H I J K L M N

O P Q R S T U

V W X Y Z

63

2

6

8 1923

242527

29

31

34 363839

40474849 50

525556

5853

59

XQW Z Y U T S V R K N M L E D C B FJ O I H P G A


Step 3: Divide Work Evenly Based on Sorted Order

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z

A B C D E F G

H I J K L M N

O P Q R S T U

V W X Y Z

63

2

6

8 1923

242527

29

31

34 363839

40474849 50

525556

5853

59

XQW Z Y U T S V R K N M L E D C B FJ O I H P G A


Z- Ordering

Aka Morton or shuffled bit ordering. For 2-D,

point (x2x1x0, y2y1y0)

mapped to y2x2y1x1y0x0

39

40 41

42 43

44 45

46 47

48 49

50 51

52 53

54 55

57

58 59

60 61

62 63

2

56

5

7

8 9

10 11

12 13

14 15

16 17

18 19

20 21

23

24 25

30 31

0 1 4

3

27

28 29

26

6 22

32 33

34 35

36 37

38

For 3-D, (xk . . . x1x0, yk . . . y1y0, zk . . . z1z0) → zkykxk . . . z1y1x1z0y0x0


Hilbert vs. Z

Both extend to arbitrary dimensions.

Both give regions with boundary (communications)within constant factor of optimal.

Hilbert ordering assigns only 1 contiguous region to aprocessor, Z- ordering may assign 2.

Z slightly easier to compute than Hilbert.

Hilbert can be used for surface of cube or sphere, Zdoesn’t seem to be as useful.

In practice, little difference in performance.


High-Dimensional Data

For high dimensions Hilbert ordering requires extensivememory to store tables used to compute index.

However, often not relevant since

geometric approaches not nearly as usefulon high-dimensional data


Shared Memory Parallelization

Parallel programming on shared memory (SM) machineshas always been important in high performance computing.

All processors can access all the memory in the parallelsystem (access time can be different).

In the past: Utilization of such platforms has never beenstraightforward for the programmer.

Vendor-specific solutions via directive-based compilerextensions dominated until the mid 90’s.

Also: data parallel extensions to Fortran90, HighPerformance Fortran (HPF), but lack of efficiency.


Parallelization Techniques: OpenMPSince 1997: OpenMP is the new industry standard forshared memory programming.

In 2008: The OpenMP Version 3.0 specification wasreleased (new feature: task parallelism).

OpenMP is an Application Program Interface (API):directs multi-threaded shared memory parallelism⇒thread based parallelism

Explicit (not automatic) programming model: theprogrammer has full control over the parallelization,compiler interprets parallel constructs.

Based on a combination of compiler directives, libraryroutines and environment variables.

OpenMP uses the fork-join model of parallel execution.


OpenMPOpenMP can be interpreted by most commercial Fortranand C/C++ compilers , supports all shared-memoryarchitectures including Unix and Windows platforms, andhence

should be your programming system of choice forshared memory platforms

OpenMP home page and recommended online tutorial:http://www.openmp.org

http://www.llnl.gov/computing/tutorials/openMP/


http://www.openmp.org

http://www.llnl.gov/computing/tutorials/openMP/

Goals of OpenMP

Standardization: standard among all shared memoryarchitectures and hardware platforms

Lean: simple and limited set of compiler directives forshared memory machines. Often significant parallelismby using just 3-4 directives.

Ease of use: supports incremental parallelization of aserial program, unlike MPI which typically requires anall or nothing approach.

Portability: supports Fortran (77, 90, 95), C (C90,C99) and C++


OpenMP: 3 Building BlocksCompiler directives (imbedded in user code) for

parallel regions (PARALLEL)parallel loops (PARALLEL DO)parallel sections (PARALLEL SECTIONS)parallel tasks (PARALLEL TASK)sections to be done by only one processor (SINGLE)synchronization (BARRIER, CRITICAL, ATOMIC,locks, etc.)data structures (PRIVATE, SHARED, REDUCTION)

Run-time library routines (called in the user code) likeOMP_SET_NUM_THREADS,OMP_GET_NUM_THREADS, etc.

UNIX Environment variables (set before programexecution) like OMP_NUM_THREADS, etc.


OpenMP: The Fork-Join Model

Parallel execution is achieved by generating threads whichare executed in parallel (multi-threaded parallelism):

OF

RK

OJ

IN

OF

RK

IOJ

Nthread

master

parallel region parallel region


OpenMP: The Fork-Join Model

Master thread executes sequentially until the firstparallel region is encountered.

FORK: The master thread creates a team of threadswhich are executed in parallel.

JOIN: When the team members complete the work,they synchronize and terminate. The master threadcontinues sequentially.

Number of threads is independent of the number ofprocessors.

Quiz: What happens if# threads or tasks > # processors# threads or tasks < # processors


OpenMP: Work-sharing ConstructsDO/for loops: type of “data parallelism”

SECTION: breaks work into independent sections thatare executed concurrently by a thread (“functionalparallelism”), units of work are statically defined atcompile time

TASK: breaks work into independent tasks that areexecuted asynchronously in the form of dynamicallygenerated units of work (“irregular parallelism”),

SINGLE: serializes a section of the code. Useful forsections of the code, that are not threadsafe (I/O).

OpenMP recognizes compiler directives that start with!$OMP (in Fortran)#pragma omp (in C/C++)


OpenMP: Work-sharing Constructs

Fork Fork Fork

Join Join Join

team teamDO/for loop

DO/for loop SECTIONS SINGLE

master threadmaster thread

master thread master thread

No barrier upon entry to these constructs, but impliedbarrier (synchronization) at the end of each⇒functionality of the OpenMP directive !$OMP BARRIER


Parallel Loops (1)

⇒ in Fortran notation

!$OMP PARALLEL DO

DO i = 1, na(i) = b(i) + c(i)

END DO

!$OMP END PARALLEL DO


Parallel Loops (2)

Each thread executes a part of the loop.

By default, the work is evenly and continuously dividedamong the threads⇒ e.g. 2 threads:

thread 1 works on i = 1 . . . n

2

thread 2 works on i = (n

2+ 1) . . . n

The work (number of iterations) is statically assigned tothe threads upon entry to the loop.

Number if iterations cannot be changed during theexecution.

Implicit synchronization at the end, unless “NOWAIT”clause is specified.

Highly efficient, low overhead.


Parallel Sections (1)

⇒ in Fortran notation

!$OMP PARALLEL SECTIONS

!$OMP SECTIONDO i = 1, n

a(i) = b(i) + c(i)END DO

!$OMP SECTIONDO i = 1, k

d(i) = e(i) + e(i-1)END DO

!$OMP END PARALLEL SECTIONS


Parallel Sections (2)

The two independent sections can be executedconcurrently by two threads.

Units of work are statically defined at compile time.

Each parallel section is assigned to a specific thread,executes work from start to finish.

Thread cannot suspend the work.

Implicit synchronization unless “NOWAIT” clause isspecified.

Nested parallel sections are possible, but can becostly due to high overhead of parallel regioncreation.difficult to load balance, possibly unneeded sync.therefore: impractical


Parallel Tasks (1)

Main change in OpenMP 3.0 (May 2008)

Allows to parallelize irregular problems likeunbounded loops (e.g. while loops)recursive algorithms

Unstructured parallelism

Dynamically generated units of work

Task can be executed by any thread in the team, inparallel with others

Execution can be immediate or deferred until later

Execution might be suspended and continued later bysame or different thread


Parallel Tasks (2)Example: Pointer chasing in C notation

#pragma omp parallel{

#pragma omp single{

p = listhead ;while (p) {

/* create a task for each element of the list */#pragma omp taskprocess (p) ; /* process the list element p */p=next(p);}

}}


Parallel Tasks (3)

Single construct ensures that only one threadtraverses the list

Single thread encounters task directive and invokesthe independent tasks

“Task” construct gives more freedom for scheduling,can replace loops with if statements that are not wellload-balanced

Parallel tasks can be nested within parallel loops orsections


Parallel Loops and Scope of Variables

Parallel DO loops (“for” loops in C/C++) are often themost important parallel construct.

The iterations of a loop are shared across the team(threads).

A parallel DO construct can have different clauses likeREDUCTION .

sum = 0.0!$OMP PARALLEL DO REDUCTION(+,sum)

DO i = 1, nsum = sum + a(i)

END DO

!$OMP END PARALLEL DOStout and Jablonowski – p. 160/324

Parallel Loops and Load Balancing

Example of a parallel loop with dynamic load-balancing:

!$OMP PARALLEL DO PRIVATE(i,j), SHARED(X,N),!$OMP& SCHEDULE (DYNAMIC,chunk)

DO i = 1, nDO j = 1, i

x(i) = x(i) + jEND DO

END DO



Parallel Loops and Load Balancing

Iterations are divided into pieces of size chunk.

When a thread finishes a piece, it dynamically obtainsthe next set of iterations.

DYNAMIC scheduling improves the load balancing,default: STATIC .

Tradeoff: Load Balancing and OverheadThe larger the chunk, the lower the overhead.The smaller the size (granularity), the better thedynamically scheduled load balancing.


New in OpenMP 3.0: Loop Collapsing

Loops can be collapsed via the clause COLLAPSE

!$OMP PARALLEL DO COLLAPSE(2)

DO k = 1, pDO j = 1, m

DO i = 1, nx(i,j,k) = i*j + k

END DOEND DO

END DO



Loop Collapsing

Iteration space from the two loops is collapsed into asingle one

Good ifloops k and j do not depend on each other (norecursions)execution order can be interchangedloop limits p and m are small, #processors is large

Rules:perfectly nested loops (j loop immediately follows kloop)rectangular iteration space (m independent of p)


Quiz: Is there something wrong ?

Assume: 4 parallel shared memory threads, all arrays andvariables are initialized.

! start the parallel region!$OMP PARALLEL PRIVATE(pid), SHARED(a,b,n)! get the thread number (0..3)pid = OMP_GET_THREAD_NUM()! parallel loop!$OMP DO PRIVATE(i)DO i = 1, n

A(pid) = A(pid) + B(i) ! computeEND DO!$OMP END DO! end the parallel region!$OMP END PARALLEL


False Sharing Example

Suppose you have P shared memory processors, withpid = 0 . . . P-1

Each processor runs the Fortran code:DO i = 1, n

A(pid) = A(pid) + B(i)END DO

No read nor write (load and store) conflicts, since notwo processors read or write same element, but:

Performance is horrible!



Reason:

Several consecutive elements of A are stored in samecache line.

In each iteration, each processor gets an exclusive copyof entire cache line to write to, all other processors mustwait.

B read-only, so sharing not a problem.

⇒ Can be avoided by declaring A(c,0:P-1), where c elementsequal 1 cache line, and using A(1,pid).

False sharing is usually obvious once pointed out, but veryeasy to write in and overlook. Avoid!



lines (not shared)

��

��

��

��

��

��

��

��

��

��

��

��

c different cache

2D: A(c,0:P−1)

1D: A(0:P−1) same cache line:cache conflicts

��


Race Conditions

In a shared memory system, one common cause of errorsis when a processor reads a value from a memory locationthat has not yet been updated.

This is a race condition , where correctness dependson which processor performed its action first.

Often hard to debug because the debugger often runsthe program in a serialized, deterministic ordering.

To insure that “readers” do not get ahead of “writers”,process synchronization is needed.

DM systems: messages are often used to synchronize,with readers blocking until the message arrives.

Shared memory systems: barriers, softwaresemaphores, locks or other schemes are used.


Race Condition Example

Two PARALLEL SECTIONS :

!$OMP PARALLEL SECTIONS

!$OMP SECTIONA = B + C

!$OMP SECTIONB = A + C

!$OMP END PARALLEL SECTIONS

Unpredictable results since the execution order matters.

Program will not fail: Wrong answers without a warningsignal!


OpenMP: Traps

OpenMP is a great way of writing fast executingcode and your gateway to special painful errors.

OpenMP threads communicate by sharing variables.

Variable Scoping: Most difficult part of shared memoryparallelization

Which variables are sharedWhich variables are private

If using libraries: Use the threadsafe library versions.

Avoid sequential I/O (especially when using a singlefile) in a parallel region: Unpredictable order.


OpenMP: Traps

Common problems are:

False sharing: Two or more processors accessdifferent variables that are located in the same cacheline. At least one of the accesses is a “write” whichinvalidates the entire cache line.

Race condition: The program’s result changes whenthreads are scheduled differently.

Deadlock: Threads lock up waiting for a lockedresource that will never become available.


Something to think about over the break

Question: How would you distribute the work in a climatemodel ?

Latit

ude

Longitude

South Pole

Equator

North Pole


HYBRID COMPUTING

Many of today’s most powerful computers employ bothshared memory (SM) and distributed memory (DM)architectures.

These machines are so-called hybrid computers.

The corresponding hybrid programming model is acombination of shared and distributed memoryprogramming (e.g. OpenMP and MPI).

Today: hybrid architectures are dominant at the highend of computing.

In the future: the hybrid memory architecture is likely toprevail despite popular DM machines like IBM’s “BlueGene”.


Memory Systems: Distributed Memory

All memory is associated with processors.

To retrieve information from another processor’smemory a message must be sent over the network.

Advantages:Memory is scalable with number of processorsEach processor has rapid access to its own memorywithout interference or cache coherency problemsCost effective: can use commodity parts

Disadvantages:Programmer is responsible for many of the details ofthe communicationMay be difficult to map the data structureNon-uniform memory access (NUMA)


Memory Systems: Shared Memory

Global memory space, accessible by all processors

Memory space may be all real or may be virtual

Consistency maintained by hardware, software or user

Advantages:Global address space is user-friendly, algorithm mayuse global data structures efficientlyData sharing between tasks is fast

Disadvantages:Maybe lack of scalability between memory andCPUs. Adding more CPUs increases traffic onshared memory - CPU pathUser is responsible for correct synchronization


Hybrid Memory Architecture

The shared memory component is usually a cachecoherent (CC) SMP node with either uniform(CC-UMA) or non-uniform memory access (CC-NUMA)

CC: If one processor updates a variable in sharedmemory, all the other processors on the SMP nodeknow about the update.

The distributed memory component is a cluster ofmultiple SMP nodes .

SMP nodes can only access their own memory, not thememory on other SMPs.

Network communication is required to move data fromone SMP node to another.


Hybrid Memory Architecture

CPUCPU

CPU CPU

CPUCPU

CPU

CPUCPU

CPU CPU

CPUCPUCPU

CPUCPUMemory

Network

Memory

Memory Memory

SMP nodeSMP node

SMP node SMP node

CPU: single−core or multi−core technology possible

Multi-core (dual- or quad-core) chips common, even inlaptops

Typical: Several multi-core chips form an SMP node.Stout and Jablonowski – p. 178/324

Multi-Cores and Many-CoresGeneral trend in processor development: multi-core tomany-core with tens or even hundreds of cores

AdvantagesCost advantage.Proximity of multiple CPU cores on the same die,signal travels less, high CC clock rate.

Disadvantages:More difficult to manage thermally than lower-densitysingle-chip design.Needs software (e.g. OS, commercial) support.Multi-cores share system bus and memorybandwidth: limits performance gain. E.g. ifsingle-core is bandwidth-limited, the dual core is only30%-70% more efficient.


Dual Level ParallelismOften: Applications have two natural levels of parallelism.Take advantage of it and exploit the shared memoryparallelism by using OpenMP on an SMP node. Why?

MPI performance degrades whendomains become too smallmessage latency dominates computationparallelism is exhausted

OpenMPtypically has lower latencycan maintain speedup at finer granularity

Drawback:

Programmer must know MPI and OpenMP

Code might be harder to debug, analyze and maintainStout and Jablonowski – p. 180/324

Hybrid Programming Model

Combination of distributed and shared memoryprogramming models, e.g.:

MPI and OpenMPMPI and High Performance Fortran (HPF)MPI and POSIX Threads

Most important: MPI and OpenMPMany MPI processesEach MPI process is assigned to different SMP nodeExplicit message passing between the nodesShared memory parallelization within an SMP nodeEach MPI process is therefore a multithreadedOpenMP processCan give better scalability than pure MPI or OpenMP


Hybrid Programming Strategy

Decompose the computational domainMost often: Domain decompositionAlternatively: Functional decomposition

Distribute the partitions among the SMP nodes (coarsegrain parallelism).

Use MPI to communicate the ghost regions orinterfaces of each partition.

Add OpenMP for loop-level parallelism within a partitionon the SMP node (fine grain parallelism).

Let one OpenMP thread speak for all.


Hybrid Programming Strategy

Recommended:Limit MPI communication to serial OpenMP part(outside a parallel region)Let the master thread (serial OpenMP part)communicate via MPI messages.


VECTOR PARALLEL COMPUTING

Principles behind vector parallel computing

Vector pipeline

Pipelining and modern scalar processors

Characteristics of vector computers

Load Balancing and Grid Partitioning Strategies

Graphics Processing Units (GPUs)


Vector Computers: Trend

What are the trends in high performance computing ?


Vector Computers: TrendWorldwide: vector computers became less and lesscommon over the last 15 years

In 2008: NEC and Cray remain in this market

Powerful vector architecture:41 TFlop/s NEC SX-6 system (peak performance):Earth Simulator (Japan, #49 TOP500 list in 6/2008,#20 in 6/2007, #1 from 2002-2004)NEC SX-9, theoretical peak performance 839TFlop/sCray XT5h (newest installation in Edinborough),hybrid architecture with X2 vector processing node

Extreme sustained performance: Earth Simulatorsystem reaches approx. 90% of its peak performance(Linpack benchmark)


Vector Processing - Pipelining Principle

Principle: Split an operation into independent parts &execute them concurrently in specialized pipelines

Example: Add pipeline

DO I = 1, 1000C(I) = A(I) + B(I)

ENDDO

Independent steps:compare and normalize exponentsadd mantissaenormalize resulterror handling (overflow/underflow)


Vector Pipelines: Example (cont.)

1

1

1

1

2

2

2

2

3

3

3

3

3

4

4

4

4

4

3 4 5

5

5

5

5

51 2

21

Startup phase

add mantissae

compare exponents

check errors

normalize results

normalize

Streaming phase

Two phases:Startup phase (fill the pipeline)Streaming phase (1 result per clock cycle)


Vector Processing - Principles

SIMD principle: One instruction works on a data stream(vector).

Vector: A vector consists ofdata that lie consecutively in memory (ideal case)data with constant stridedata with random access (gather & scatteroperations)

Pitfall: Non-consecutive memory accesses can lead tomemory bank conflicts and performance losses.


Principles (cont.)

Pipelining: The functional units are divided into independentsegments which work simultaneously.

Add pipelineMultiply pipelineMultifunctional pipeline, e.g. multiply and addLogic pipelineLoad/Store pipelineInstruction pipeline


Pipelines and Modern Scalar Processors

The pipelining principle: basis for all vector machinesand GPUs.

But pipelines are also used in modern scalarprocessors⇒ speed up execution

Examples:

IBM Power6 CPU: Floating point units (FPU) which canissue a combined multiply/add

a = b* c + c

Multi-functional hardware unitIn addition: data prefetch capabilities (“loadpipeline”)


Vector processing - Hardware differences

Scalar:

Memory

scalar

CPU

data addresses


Vector processing - Hardware differences

Vector:

Memory

...Bank n-1

Bank 0 mod(Addr.,n)Bank =

vector scalardata addresses

CPU


Vector Processing - Features

The new hardware/software features are:

Vector unit: “co-processor” to scalar unit

Pipeline sets

Vector registers that provide data streams

Interleaving memory banks: quick memory access

(Often) no data cache for vector unit

Software & hardware interface: vector instructions

Vectorizing compiler

“Break Even Point” is hardware dependent (vectorlength that lets the vector unit outperform the scalarunit)


Vector Processing - Features

The performance of the vector unit depends on the vectorlength (number of operations):

number of operations

perf

orm

ance

n1/2 n

In general: long vectors boost the performance

Startup time becomes negligible with increasing n


Load-Balancing & Grid Partitioning

Left: Fragmented 2D grid partitioning good forload-balancing, but short vectorsRight: good vectorization (long vectors), but possibly badload-balancing properties


Load-Balancing & Grid Partitioning

The more processors run the simulation the smaller arethe partitions, the smaller is the vector length on eachprocessor.

From a computational standpoint: partitioning strategyon the left is well-load-balanced (e.g. day/night sides ina weather model have different workloads and arewell-distributed).

From a numerical performance standpoint: distributionon the right is more efficient (longer vectors), but suffersfrom load imbalances.

⇒ : In case of uneven workloads balance must be foundbetween long vectors and fragmented load balancingstrategy.


Parallel Vector Computing

Parallel vector computers are powerful for scientificapplications:

sustained performance can reach more than 30% ofthe peak performancecompare: on MPP machines approx. 10-20% of thepeak performance is reached (optimistic)

Single processor performance on vector machines is amultiple of any scalar processor.

Computations need smaller number of parallel vectorprocessorsAdvantageous if application does not scale well tolarge number of parallel CPUs


Parallel Vector Computing (cont.)

Parallel vector machines become most effective forlarge application that require identical (arithmetic)instructions on streams of data.

The vector performance strongly depends onVector length The longer the more effective !Data access Consecutive data access outperforms

indirect addressing and data with constant stride.Number of operations The more arithmetic operations

can be performed at once the more effective thevector unit (enables chaining).


Graphics Processing Units (GPUs)

Newest trend in high-performance computing.

Traditionally: GPU dedicated graphics rendering devicefor a personal computer, workstation or game console.

GPUs have a parallel many-core architecture, eachcore capable of running thousands of threadssimultaneously, exploit SIMD fine-grain parallelism

Highly parallel structure makes them more effectivethan general- purpose CPUs for a range of complex(highly specialized) algorithms.


Graphics Processing Units

Trend: Highly diverse computing platforms can includemulti-cores, SMP nodes, graphics accelerators orclassical vector units as co-processors for boththread-based and process-based parallelism.

GPUs are cheap: commodity co-processors producedin the millions.

Very fast, first 1 TFlop/s GPU was out in February 2008

#1 computer on Top 500 list: “Roadrunner” utilizesGPUs as accelerators: IBM’s GPU Cell processororiginally designed for the Sony Playstation 3.


GPUs — Future?

Extremely difficult to use the hardware effectively.

For example: NVIDIA’s GeForce GPU seriesprogrammed in CUDA (Compute Unified DeviceArchitecture): compiler and set of development tools(variation of C).

Big question: What is the lifetime of these systems? Isit worth investing into user software?

Need robust hardware: Error trapping, IEEEcompliance, hardware performance counters, circuitsupport for synchronizations.

Need robust compilers and programming standards.

Will it attract new sources of talent to supercomputing?


PARALLELIZATION II

Here we examine some of the more complicated aspects ofsuccessfully parallelizing large programs.


Problems Verifying Correctness

Proving parallel and serial programs equivalent typicallyonly possible if the parallelization automated (such as aparallelizing compiler).

Thus usually resort to testing on selected inputs.

Sensitivity & efficiency at discovering errors can bemagnified by examining intermediate results, ratherthan just final results.


However, problems remain:

Coverage: Need to test all program options.

Time: Some conditions only appear after a significantamount of computation.

Detection: Often simple “diff” won’t work, hard todifferentiate between errors and roundoff caused bychanged order of arithmetic operations. Some usersuncomfortable with slight machine variations.


Some Solutions

Coverage: Typically requires coordination with applicationexpert, and careful analysis. There are tools to checkcoverage.

Time: Checkpoint/restart can help. Also very useful forlong-term maintenance and for production runs so thatwork not lost if system fails during a long run.

Detection: Use of IEEE arithmetic helps cross-platformcomparisons. Also, by being careful can insure that theparallel and serial programs perform all calculations(such as summations) in the same order, but usuallythis lowers efficiency.


Performance Problems

Detailed profiling of the crash code showed that there weremany places where efficiency was unacceptable.

Often cache utilization was very poor.

Load balance difficult due to heterogeneous elementswith time-varying requirements.

Contact adds dynamic computational andcommunication imbalance.

Some of the collective communication routines were tooslow.

I/O was substantial, and was often inefficient.


Profiling

Profiling proceeded in stages, identifying whereefficiency was too low.

For targeted section, profiled uniprocessorperformance, such as cache misses.

Also profiled load imbalance and communicationoverhead, proceeding from smaller systems to largerones (when needed).

Incremental approaches kept the amount of datacollected at manageable reasonable levels.

Unfortunately, when were doing this there were nostandard tools, we had to build several. Situation muchbetter now, discussed later.


Utilizing the Memory Hierarchy

Effective use of cache and locality often critical forachieving high performance.

Often uniprocessor performance can be doubled byrestructuring data structures and computations toexploit cache

Unfortunately, many data structures and algorithms usepointers and indirect addressing, diminishing the abilityof the compiler to optimize cache usage.

Later we’ll describe a data structure (adaptive blocks)that addressed this


Cache Misses

Many programs have excessive loads and stores, causingcache misses which slow the program. Can often bereduced by rearranging the code and/or data structure.

For example, in Fortran

do i=1,n do j=1,ndo j=1, n do i=1,n

A[i,j]=A[i,j]+1 vs A[i,j]=A[i,j]+1enddo enddo

enddo enddo

For large arrays, which is faster, and why?


Utilizing the Compiler

For a well-structured program it should be possible for thecompiler to generate good code — optimizing cacheutilization, reducing instruction counts, etc. However,extensive optimization is not the default. Thus

Turn on appropriate compiler optimization options.

Usually “O” option important, but often others needed aswell. These affect data placement as well as codegeneration.


Utilizing the Compiler

For a well-structured program it should be possible for thecompiler to generate good code — optimizing cacheutilization, reducing instruction counts, etc. However,extensive optimization is not the default. Thus

Turn on appropriate compiler optimization options.

Usually “O” option important, but often others needed aswell. These affect data placement as well as codegeneration.

May need a guru to get best combination of options for yourprogram+machine combination.


LOAD-BALANCING REVISITED

We’ll continue the discussion of load-balancing, looking atsome more complicated problems.


Loop Dependencies

Recall that if the value of variable B depends upon the valueof variable A, then there is a dependency between A and B.

Loops often introduce real, or apparent, dependencies.

For example,

do i=1,nV[i]=V[i] − 2*V[i−1]

enddo

The loop cannot be vectorized nor parallelized, becauseeach value depends upon value from previous iteration.


To parallelize

do i=1,nV[i]=V[i] − 2*V[i+1]

enddo

need to copy V and use copy to compute new values.


To parallelize

do i=1,nV[i]=V[i] − 2*V[i+1]

enddo

need to copy V and use copy to compute new values.

W=Vdo i=1,n

V[i]=W[i] − 2*W[i+1]enddo


To parallelize

do i=1,nV[J[i]]=i

enddo

need to know if J is 1-1.

Some automatic parallelizers can handle the previous loop,but none can do this one without programmer assistance.


Time Troubles

Parallelization problems of time-like variables includes:

Partial Differential Equations Time is explicit.

Divide and Conquer “Size” similar to time. Subproblems maynot be known in advance, and need to be generated inorder.

Branch and Bound Branching control is often serialized.

Discrete Event Simulation Time is usually explicit, may beincremented adaptively, and subproblems often notknown in advance.

Depth-First Search Search decisions made sequentially.Theoretical computer science: some versions of DFS are NC-complete


Not Everything is As Bad As It Seems

Some things look serial but can be easily parallelized.

Reduction

x← 0do i← 0, n-1

x← x + a[i]enddo

Scan or Parallel Prefix

y[0]← a[0]do i← 1, n-1

y[i]← y[i-1] + a[i]enddo


Parallelized Reduction Operations

Reduction and scan operations are extremely common.

They are recognized by parallelizing compilers andimplemented in MPI and OpenMP.

They can be parallelized by using associativity of thecombining operator ( + in this case) i.e.,

a + (b + c) = (a + b) +c

In some situations one also uses commutativity, i.e.,a + b = b + a


Calculation Tree

a[0] a[1] a[2] a[3] a[4] a[5]a[6] a[7]

+ + + +

+ +

+


Static Load Imbalance — Correlation

Suppose have digital image, need to determine types ofvegetation on the island. Easy load-balance:

0 1 2 3

4 5 6 7

8 9 10 11

12 1413 15


However ...

If pixel is water can quickly dismiss it, otherwise need tocarefully analyze pixel and neighbors.


However ...


Drat! We know the weights, but don’t know where the easyor hard pixels are until we’ve started processing the image.


However ...


Drat! We know the weights, but don’t know where the easyor hard pixels are until we’ve started processing the image.

Especially problematic because large regions will be of onetype or the other. Thus some processors will take muchlonger than others.


Scattered Decomposition

Used when there is a structured domain space (e.g., animage) and the processing requirements are clustered,such as modeling a crash or processing an image with onlya few items of interest.

Suppose there are P processors. Cover the problemdomain with non-overlapping copies of a grid of size P andassign each processor a cell in each of the grids.


Scattered Work

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15


How Much Scattering?

More pieces⇒

⇓ load imbalance, i.e., ⇓ calculation time

⇑ overhead and/or communication time

Deciding a good tradeoff may require some timingmeasurements.


How Much Scattering?

More pieces⇒

⇓ load imbalance, i.e., ⇓ calculation time

⇑ overhead and/or communication time

Deciding a good tradeoff may require some timingmeasurements.

However, if nearby objects have uncorrelated computationalrequirements then this method is no better than standarddecomposition, and adds overhead.


Overdecomposition

Scattered decomposition and its close relatives striping andround robin allocation are examples of a general principle:

Overdecomposition: break task into more piecesthan processors, assign many pieces to eachprocessor.

Overdecomposition underlies several load-balancing andparallel computing paradigms.

However, there can be difficulties when synchronization isinvolved.


The (Teaching) Value of Coins

Task times are random variables, where the time isgenerated by flipping a coin until a head appears.

Your task times:

Class task times:

Your total:

Class total:

Slowest person’s total:


Synchronization and Imbalance

Suppose have p processors and n ≥ p tasks. Supposetasks take time i with probability 2−i, and there is no way totell in advance how long the task will take.

If each processor does 1 task and then waits for allprocessors to complete before going on to the next, theefficiency is low. In fact, it grows as the log of the number ofprocessors.

To improve efficiency, each processor needs to completeseveral tasks before synchronizing.


Geometric Task Times

No. Efficiency Tasks/Proc.Proc. 1 Task per to achieve

Processor Efficiency0.8 0.9

4 0.57065 10 4616 0.37193 30 13764 0.27233 53 243

256 0.21423 78 3551024 0.17647 103 468


Another Example

Tasks: 1 time unit with prob 0.9, 10 units with prob. 0.1

No. Efficiency Tasks/Proc.Proc. 1 Task per to achieve

Processor Efficiency0.8 0.9

4 0.46397 36 17916 0.22803 112 53664 0.19020 199 949

256 0.19000 291 13841024 0.19000 385 1824


Note that one can keep the efficiency high by assigningmany tasks per processor before synchronizing, but thenumber required grows with the number of processors.

Later we’ll see a technique to improve this situation.


Dynamic Data-Driven

For many data dependent problems dynamic versions alsooccur, such as

For PDEs an adaptive grid can be used instead of afixed grid, allowing one to focus computations onregions of interest.

A simulation may track objects through a region.

Computational requirements of objects may changeover time.

In such situations, some processors may becomeoverloaded.


Must balance load and need to take locality ofcommunication into account. Some options:

Locally adjust partitioning, such as moving small regionon boundary of overloaded processor to processorcontaining the neighboring region.

Use a parallel rebalancing algorithm that takes currentlocation into account (not standard).

Rerun the static load-balancing algorithm andredistribute work (ignores locality, but easier)

Warning: Need more complex data structures which canmove pieces and keep track of neighbors, etc. These aredifficult to program and debug.


Dynamic Graph Decomposition

One could rerun Metis at periodic intervals, or periodicallymeasure some metric to determine if processor loads toouneven, and if so then call Metis.

However, more efficient to use the ParMetis package whichruns in parallel.


Example: Dynamic Geometry

Adaptive blocks, useful for adaptive mesh refinement(AMR), dynamic geometric modeling. Grids broken intoblocks of fixed extents, when needed blocks refined intochildren with same extents. [Stout 1997, MacNeice et al. 2000]

refine

coarsen


Adaptive Block Properties

Whenever refine/coarsen occurs, must adjust pointers onall neighbors, no matter what processor they are on.

Using blocks, instead of cells, reduces the number ofchanges.

Same work per block, good work/communication ratio, sooften just balancing blocks per processor suffices. Ifcommunication excessive use space-filling curve.

In either case, rebalancing requires only simple collectivecommunication operations to decide where blocks go.


Load-balancing Strategies

Example: Tracer transport problems with adaptivemesh refinement (AMR) techniques

Simple load-balancing algorithm:Equal workload regardless of the location of the data

Advanced load-balancing algorithms:Load-balancing with METISLoad-balancing with a Space Filling Curve (SFC)

⇒ In the examples:

Each color represents a processor.

The amount of work in each box is the same.


Simple Load-balancing Strategy

-90

-45

0

45

90

Lat

itude

0 90 180 270 360Longitude

MovieStout and Jablonowski – p. 238/324

Simple Load-balancing Strategy cont.

Data distribution at model day 3:

-90

-45

0

45

90

Lat

itude

0 90 180 270 360Longitude


Simple Load-balancing Strategy cont.

Data distribution at model day 12:

-90

-45

0

45

90

Lat

itude

0 90 180 270 360Longitude


Dynamic Load-balancing with METIS

MovieCourtesy of Dr. Joern Behrens, Alfred-Wegener-Institute,Bremerhaven, Germany


Dynamic Load-balancing with SFC

MovieCourtesy of Dr. Joern Behrens, Alfred-Wegener-Institute,Bremerhaven, Germany


Comparison of Strategies

Relative behavior similar to static load-balancing behavior.Very important that rebalance operations have low overheadsince they will be done often.

Easiest strategy — just balance work/processormight be sufficient if application is dominated bycomputation, but not if communication important

Load-balancing with METIS or ParMETISgood load-balancing, decent comm. reduction,applicable to many problems

Load-balancing with Space Filling Curvesfor geometric problems usually the best choice


Dynamic, Data Driven, Min. Comm.

Sometimes work created on the fly with little advanceknowledge of tasks.

E.g„ branch-and-bound generates dynamic partialsolution trees where subproblem communicationconsists of maintaining a current best solution andseeing if subproblem already solved.

In such situations can maintain a queue of tasks(objects, subproblems) and assign to processors asthey finish previous tasks (e.g., overdecomposition).


Example: Work Preassigned

Each processor is assigned 4 tasks.

Processor Task Label/Time Total1 a/5 b/1 c/1 d/4 112 e/1 f/4 g/2 h/1 83 i/2 j/1 k/5 l/1 94 m/1 n/3 o/1 p/1 65 q/1 r/1 s/2 t/2 66 u/3 v/4 w/2 x/3 12

Max 12

Time required: 12.


Manager/Worker (Master/Slave) (prof/grad student)

Manager

worker

worker

workerworker

Task Queue

assign tasks

task donerequestanother


Work Assigned via Queue

Assign tasks a, b, c, ... to processors as the processorbecomes available:

Processor Time / task assigned1 2 3 4 5 6 7 8 9 10

1 a a a a a r v v v v2 b g g k k k k k3 c h j l n n n w w4 d d d d o s s x x x5 e i i m p t t6 f f f f q u u u

Time: 10. Adaptive allocation can improve performance.


Work Assigned via Ordered Queue

Sort in decreasing order, assign to processors as theybecome available. a k d f v n u x g i s t w b c e h j l m o p q r

Processor Time / task assigned

1 2 3 4 5 6 7 8 9

1 a a a a a s s e o2 k k k k k t t h p3 d d d d x x x j q4 f f f f g g w w r5 v v v v i i b l6 n n n u u u c m

Time: 9. The more you know, the better you can do.Unfortunately, rarely have this information.


Queueing Costs

Single-queue multiple-servers (manager/workers) mostefficient queue structure (e.g., airline check-in lines).

However, queuing imposes communication overhead,yet another tradeoff, now cost of moving task versuscost of solving it where it is generated.

Parallel computing has too many “however”s!

However, if it was too easy, you wouldn’t need this tutorial


Queueing Bottleneck

Sometimes the manager is a bottleneck. Can ameliorate

“Chunk” tasks to reduce overhead. May use largechunks initially, then decrease them near the end tofine-tune load balance.

Use distributed queues, perhaps withmultiple manager/worker subteams, with somecommunication between managersevery worker is also a manager, keeping some tasksand sending extras to others. Many variations ondeciding when/where to send work.


OpenMP Load-Balancing

The previous descriptions had a distributed memory flavor,though they also work well for shared memory.

However, shared memory has additional options. OpenMPloop work-sharing constructs require little programmereffort. With the SCHEDULE option can specify

STATIC: simple, suitable if loop iterations take same amountof time and there are enough per processor. Forscattered decomposition, specify chuck size.

DYNAMIC: a queue of work, each processor gets chunksizeiterations when ready.

GUIDED: dynamic queue with chunks of exponentiallydecreasing size.


Load-Balancing Summary

Load-balancing is critical for high performance.

Depending on the application, can range from trivial tonearly impossible. A wide range of approaches are needed,and new ones are constantly being developed.

Load-balancing needs to be approached as part of asystematic effort to improve performance.


Load-Balancing Summary

Load-balancing is critical for high performance.

Depending on the application, can range from trivial tonearly impossible. A wide range of approaches are needed,and new ones are constantly being developed.

Load-balancing needs to be approached as part of asystematic effort to improve performance.

Try simple approaches first.


DATA INTENSIVE COMPUTING

Databases are an important commercial application ofparallel computers, providing a base which helps keepcommercial parallel computing viable.

Massive data collections becoming important in scientificfields such as bioinformatics, astronomy, physics, . . . .

Many of the ideas are used elsewhere, though sometimesobscured by different terminology. We’ll just briefly examinesome aspects.


Application Areas

Web browsing

Real-time applications: air traffic, stock trading,streaming multimedia

Data Warehouse: organize massive amounts ofcommercial, scientific dataCERN Large Hadron Collider: ≈ 30TB/day, ≈10PB/year

Data Mining: extract useful information from vastcollections of text, photographs, web pages, etc.


Some Terminology

Often data intensive systems use terminology that issomewhat different, though often ideas similar to onesalready touched on. Some examples:

skew load imbalancescaleup speeduptransactions per second (TPS) throughput.

TPS is often used to measure performance, instead of flops


Characteristics

Disk access and bandwidth dominates performance.Organizing the information to match the accesspatterns is often critical.

Systems for scientific applications somewhat newer,complicated by factors such as being dispersed amongsites, people trying to combine or mine information innew ways, billions of files (e.g., a constant stream ofimages), etc.

Sample science collections include Large HadronCollider, Digital Sky, Earth Observation System. Manyprovide specialized tools to access the information.


Parallel Disk Architectures

Shared Everything (SE) All disks are directly accessible fromall processors and all memory is shared, i.e., standardshared memory system.

Shared Nothing (SN) Each disk is connected to a singleprocessor or SMP, each has its own private memory.Most common option in clusters.

Shared Disks (SD) Any processor can access any disk, buteach processor has its own private memory, e.g.,storage networks.


Shared Everything

P3P1 Pn

Interconnection Network

Global Shared Memoryshared disks

P2


Shared Nothing

P1 P2 P3 Pn

privatememory

privatememory

privatememory

privatememory

private disk private disk private diskprivate disk



Shared Disk

shared disk


P3P2P1 Pn

privateprivatememory

privatememorymemory

privatememory

shared disk shared disk


Data Partitioning Strategies

Range Partitioning (block allocation) Easy to locate records,related data can be clustered, but danger of skew.

Disks

Key Range


Data Partitioning continued

Round Robin (cyclic, striping) Allows parallelism in accessingconsecutive records, but ties up many disks if differentprograms running on system.

Disks

Key Range


Data Partitioning continued

Hashing Avoids systematic bottlenecks, allows forexpanding collection of keys (such as names), butcomplicates range queries.

Disks

Key Range


Data Partitioning Parallels

Block allocation and round robin allocation are used individing loops in OpenMP.

Round robin allocation used in memory systems ofvector machines.

Block allocation used in memory systems of commodityprocessors.

Hashed allocation used in memory system of Cydrome.


Data Mining

Sifting for information in a torrent of data economicallyand scientifically important.

AT&T, WalMart, American Express, . . . have used formany years. Bioinformatics important new applicationarea.

Many commercial data mining tools, often parallelized.

Warning: “data mining” means many different things todifferent people and applications.


Map-Reduce: New Form of Data Mining

Variations used by Google, Yahoo, IBM, etc.Open source Hadoop: http://hadoop.apache.org/core/

Companies trying to get schools to teach this style ofprogramming

Basic database operations, extended to less organized,far larger, systems.

Simple example: given records of (source page, link)for every company find # pages from outside thecompany that point to one of the company’s pages.


http://hadoop.apache.org/core/

Map: determine if link record is from page outside acompany into it. If so, generate new record(destination company, 1)

embarrassingly parallel, vast number records, I/Obound

Reduce: combine records by company and sum the counts

requires communication, but far fewer records

Implementations: significant emphasis on locality, efficiency,fault tolerance


Sample Map-Reduce Execution

Source: http://code.google.com/edu/parallel/mapreduce-tutor ial.html


http://code.google.com/edu/parallel/mapreduce-tutorial.html

PERFORMANCE

Developing large-scale scientific or commercialapplications that make optimum use of thecomputational resources is a challenge.

Resources can easily be underutilized or usedinefficiently.

The factors that determine the program’s performanceare often hidden from the developer.

Performance analysis tools are essential to optimizingthe serial or parallel application.

Typically measured in “Floating point operation persecond” like Mflop/s, Gflop/s or Tflop/s.


CPU Performance MeasuresPerformance

is compared via benchmarks like LINPACK

more relevant: benchmarks with user application

most often on scalar machines: cache-optimizedprograms reach ≈ 10% of the peak performance

Example: Weather prediction code IFS (ECMWF)


Application-System Interplay

System factors:Chip architecture (e.g. # floating point units per CPU)Memory hierarchy (register - cache - main memory -disk)I/O configurationCompilerOperating SystemConnecting network between processors


Application-System Interplay

Application factors:Programming languageAlgorithms and implementationData structuresMemory managementLibraries (e.g. math libraries)Size and nature of data setCompiler optimization flagsUse of I/OMessage passing library / OpenMPCommunication patternTask granularityLoad balancing


Performance Gains: Hardware

Factor ≈ 104 over the last 15 yearsStout and Jablonowski – p. 273/324

Performance Gains: Software

Gains expected from better algorithms, example:

1970 1980 1990 2000

104

010

1

210

10

10

103

5

Gauss-Seidel

Successive Over-RelaxationConjugate Gradient

Multi-Grid

Sparse Gaussian Elimination

Spe

edup

fact

or

Derived from Computational Methods (Linear Algebra)

Gains also expected from better load-balancingstrategies, parallel I/O, etc.


Parallel Performance AnalysisReliable performance analyses are the key toimproving the performance of a parallelized program.

They reveal not only typical bottleneck situations butalso determine the hotspots

Key question: How efficient is the parallel code?

Important to consider: Time spentcommunicating to other processorswaiting for a message to be receivedwasted waiting for other processors

When selecting a performance tool consider:

How accurate is the technique?Is the tool simple to use?How intrusive is the tool?


Parallel and Serial Performance Analysis

Goal: reduce the program’s wallclock execution timePractical, iterative approach:

measure the code with a hardware performancemonitor and profiler

analyze hotspots

optimize and parallelize hotspots and eliminatebottlenecks

evaluate performance results and improve optimization/ parallelization

Analysis techniquesTiming

Counting

Profiling

Tracing


Timing of Parallel Programs

MPI / OpenMP provide the compiler-independent timingfunctions for the wallclock time

MPI_Wtime / OMP_GET_WTIME

Requires source code changes: instrument the program

Typical sequence (MPI program):real t1, t2, secondst1 = MPI_wtime()

... code to be timedt2 = MPI_wtime()seconds = t2-t1 ! wallclock time

Evaluation of parallel speedup: Always measure thewallclock time . Measuring the CPU time would neglect thesystem overhead for the parallelization!


Hardware Performance Monitors (HPM)

Hardware counters gather performance-relevant events ofthe microprocessor without affecting the performance ofthe analyzed program. Two classes:

Processor monitor:

non-intrusive countsconsists of a group of special purpose registerregisters keep track of events during runtime:general and floating point instructions, cachemisses, branch miss predictionmeasures Mflop/s rate fairly accurately

System level monitor (bus and network monitor):

bus monitor: memory traffic, cache coherencynetwork monitor records network traffic


PAPI: The Portable Performance API

mature public-domain Hardware Performance Monitor

version Papi 3.6.1 released in 8/2008

vendor independent hardware counter tool

supports most current processors including the “Cell”processor

user needs to instrument code⇒ PAPI functions

Fortran and C/C++ user interfaces

easy-to-use and powerful high level API

Home page:http://icl.cs.utk.edu/papi/index.html


http://icl.cs.utk.edu/papi/index.html

Profiling of Parallel Programs

simplest tool: UNIX profiler gprof

interrupts program execution at constant timeintervalscounts the interruptionthe more interruptions the more time spent in thispart of the codesum of all processors is displayed

Profilers identify hotspots , but limited use for parallel code:

they measure CPU time, not wallclock time

they sum over all invocations of each routine

profilers cannot show load imbalance


Profiling: Graphical User Interfaces

Commercial: allinea opt (http://www.allinea.com ),optimization and profiling tool for multiple hardwareplatforms.

IBM AIX systems (built-in): xprofiler

Graphical user interface based upon the gprofprofiling utility.Displays: Timing and call graph profile, summarycharts, source code displays, library clusters.Filtering and zooming features allow focusing thedisplays on portions of the call tree.

Public domain Tuning and Analysis Tool TAU :http://www.cs.uoregon.edu/research/tau


http://www.allinea.com

http://www.cs.uoregon.edu/research/tau

GUI Example Xprofiler

Portions of the program which accumulate the most“ticks” (interrupts) reflect the area where the programspends the most time


Profiling: Pitfalls

Due to the periodic sampling of the program counter theoutput might be slightly different when the sameprogram is profiled multiple times.

Measure the code over a representative time intervalusing typical data sets. Sampling should last at leastseveral minutes.

Optimizing compiler flags are allowed: expect differentprofile when using the -O option, try with and withoutoptimization.

Different hardware / different compilers might lead todifferent profiles.

But: the most time consuming functions should bedetected in any case, maybe in different order.


MPI and OpenMP Trace ToolsCollect trace data at run time, display post-mortem

Assess performance, bottlenecks and load-balancingproblems in MPI & OpenMP codes

Intel’s trace visualization tool Trace Analyzer &Collector (only on Intel platforms)

Vampir and Vampirtrace (platform independent)

Trace analyzer developed and supported by the Centerfor Information Services and High PerformanceComputing, Dresden, Germany (http://vampir.eu )

Free evaluation keys for both available online.


http://vampir.eu

Trace Analyzer & Collector / VampirTrace Analyzer / Vampir graphical user interface helps

understand the application behavior

evaluate load balancing

show barriers, locks, synchronization

analyze the performance of subroutines/code blocks

learn about communication and performance

identify communication hotspots

Trace Collector / Vampirtrace

Libraries that trace MPI and application events,generate trace file (files can become big!)

Convenient: Re-link your code and run it

Provides API for more detailed analysesStout and Jablonowski – p. 285/324

Graphical User InterfaceTrace Analyzer / Vampir provides graphical displays thatvisualize important aspects of the runtime behavior:

detailed timeline view of events and communication

statistical analysis of program execution

statistical analysis of communication operations

dynamic calling tree and source-code display

I/O statistics

Trace Analyzer / Vampir

provides powerful zooming and filtering features

can display source code references if recorded

Vampir supported on almost all HPC platforms


Vampir Analysis – Global Timeline ��

here: uninstrumented version of the program

therefore: the routines of the user code can not bedistinguished and are displayed as “Application”


Vampir Analysis – Zoom-in Timeline ��

Zoom-in: ⇒ Communication and synchronizationStout and Jablonowski – p. 288/324

Vampir Analysis – Activity Chart

��

Global activity chart⇒ Load-imbalance


Vampir Analysis – Summary Chart ��

Summary for the whole application: timing data


Vampir Analysis – MPI Summary ��


Public Domain Trace ToolsJumpshot-4 (http://www-unix.mcs.anl.gov/perfvis/ )

Graphical displays of timelines, histograms, MPIoverhead and more

Instant zoom in/out, search/scan facility

TAU – Tuning and Analysis Utilities (version 2.17.1)

Developed at the University of Oregon, mature

Free, portable, open-source profiling/tracing facility(http://www.cs.uoregon.edu/research/tau )

Performance instrumentation, measurement andanalysis toolkit for distributed and shared memoryapplications (includes MPI, OpenMP)

Graphical displays for all or individual processes

Manual or automatic source code instrumentationStout and Jablonowski – p. 292/324

http://www-unix.mcs.anl.gov/perfvis/

http://www.cs.uoregon.edu/research/tau

Performance Analysis: Strategy

Hardware counters provide information on Mflop/srates, do you need to optimize?

Use profilers to identify hotspots

Focus the analysis/optimization efforts on the hotspots

Analyze trace information : gives detailed overview ofthe parallel performance, load-balance and revealsbottlenecks

two different modes: the uninstrumented orinstrumented mode (requires source code changes)⇒ Pitfall: can lead to huge trace files)Recommendation: instrument only hotspots fordetailed view of the run time behavior


Debugging of Parallel ProgramsIncreased parallel complexity makes the debuggingprocess more difficult.

Traditional sequential debugging technique is cyclicapproach where the program is repeatedly stopped atbreakpoints and then continued or re-executed again.

Conventional style of debugging sometimes difficult withparallel programs: they do not always showreproducible behavior , e.g. race condition .

Always: turn on compiler debugging options likearray-bound checks

Most powerful commercial debuggers:

TotalView (http://www.totalviewtech.com )

allinea ddt (http://www.allinea.com )Stout and Jablonowski – p. 294/324

http://www.totalviewtech.com

http://www.allinea.com

Characteristics of Totalview

Very powerful and mature debugger, current version 8.6

Source-level, graphical debugger for C, C++, Fortran,High Performance Fortran (HPF) and assembler code

Multiprocess (MPI) and multithread (OpenMP) codes

Supports multi-platform applications

Intuitive, easy-to-learn graphical interface

Industry leader in MPI and OpenMP debugging

Control functions to run, step, breakpoint, interrupt orrestart a process

Ability to control all parallel processes coherently

Good tutorial on TotalView with parallel debugging tips:http://www.llnl.gov/computing/tutorials/totalview/


http://www.llnl.gov/computing/tutorials/totalview/

TotalView: The Process Window5 panes

zoom intocode orvariables

visualizevariables

filter, sort orslice data

set break-points

scan parallelprocessses

step by stepexecution


TotalView: Message Queue Graph

Graphical representation of the message queue state⇒ Red = Unexpected, Blue = Receive, Green = Send


Boost the Performance: Practical TipsTurn on compiler optimization flags

Search for better algorithms and data structures

For scientific codes: use optimized math libraries

Tune the program:data locality and cache re-use within loopsavoid divisions, indirect addressing, IF statements,especially in loopsloop unrolling and function inlining (often compileroption), minimize/optimize I/O, ...

Load-balance the code

Avoid synchronization/barriers whenever possible

Optimize partitioning to minimize communication

Identify inhibitors to parallelism: data dependencies, I/OStout and Jablonowski – p. 298/324

Parallel Scientific Math LibrariesParallel math libraries are available on most hardwareplatforms. Highly optimized and recommended.

ScaLAPACK (Scalable LAPACK):Public-domain, high-performance linear algebraroutines for MPI applicationsPromotes modularity via interfaces to the librariesBLAS, BLACS and PBLAS

NAG Parallel Libraries (commercial, often installed):Mostly high speed linear algebra routinesIn addition: random number generation andquadrature routines

PETSc (Portable, Extensible Toolkit for Scientificcomputation):

Designed with MPI for partial differential equationsStout and Jablonowski – p. 299/324

Toolkits for Scientific ComputingACTS toolkit — Advanced CompuTational Software(http://acts.nersc.gov ):

Public domain tools mostly developed at US labs

Collection of tools that is interoperable, with API

General solutions to complex programming needs

IncludesNumerical solvers: PETSc, ScaLAPACK, Aztec, ...Structural frameworks: Software that manages data

& communication like Overture and Global ArraysRuntime & support tools: CUMULUS, TAU

Eclipse : Parallel Tools Platform (PTP)

open-source project: wide variety of parallel tools

http://www.eclipse.org/ptp/Stout and Jablonowski – p. 300/324

http://acts.nersc.gov

http://www.eclipse.org/ptp/

USING PARALLEL SYSTEMS

In addition to programming, there are many issuesconcerning the use of parallel systems.

For example, they are often a centralized resource thatmust be shared, much like mainframes of olden days.

Your institution may decide to purchase a system, or buytime elsewhere.


Batch Queuing

A return to 60’s style computer usage.

Large parallel systems use batch queuing, may allowsmall interactive jobs for debugging.

If there are multiple queues, learn how they arestructured and serviced — it’s you vs. them.

If submit several jobs at once, you may be your ownbottleneck. Might improve throughput by requestingfewer processors, and more time, per job(remember Amdahl’s Law).


Access to Systems

Academics: Can apply for free time at NSF supercomputingcenters or perhaps your own university. For modesttime the NSF process easy and quick, but thousands ofhours requires more detailed application. Need to showcan effectively utilize machines (e.g., speedup curves),and are doing good research.

Grants from other agencies usually include access totheir large systems.

Businesses: Can purchase time from hardware vendors,sometimes from university centers.


Purchasing Systems

Buying systems very complicated. Some questions:

Can it run your major applications? May depend onISVs.

Will vendor be around in five years?

Is there an upgrade path if you need to expand soon?

Can you get (and afford) tools, compilers, libraries fordeveloping new applications?

Is the system reliable? Is maintenance policyacceptable?

Do you have sufficient power and air conditioning?


What to Buy?

How much of the budget on processors vs. memory vs.communication?

Do you want more, or faster, processors, i.e.,price-performance or performance?

Need to understand major applications, and deliveredversus peak performance.


Where are You on the Curve?

As a user or buyer:

Price/Performance

Processors

Speedu

Performancep


Cluster Systems

Some groups build their own, resources are available tohelp.

However, many users just look at machine cost.

Typically total costs at least twice initial costs .

Maintenance:

Many little things, hardware and software, go wrong orneed upgrading — who will keep fixing this?

Who does backups?

Maintenance time-consuming and harmful to career


WRAP UP

We’ll review some of the material learned, discuss somegeneral problems with parallel computing, and point outsome trends in the area.


Trends in Parallel Computing

It’s useful to have a sense of where it is going.


Trend: Power Critical

For given technology, typically power ≈ speed2

Heat limits density

Power In = Heat Out, so AC demands also increase

System speed requires close components: systemssuch as BlueGene make tradeoff, slower clock speed,and smaller RAM, for greater density and moreprocessing power.

Tradeoff opposite programmer needs — Amdahl’s law.


Power Trend


Trend: Chip Density Still Increasing

Hardware designers running out of old tricks, so justreplicate processors on chips — multi-core, many-core.

While potential chip performance continues to increase,number of I/O wires/chip doesn’t match number cores,more stress on cache locality

GPUs (and IBM Cell) have large number of simpleprocessors, very high FLOPs, need locality for efficientvector-like operations.


Chip Trends

Sources: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)


Good Optimization still Bleeding Edge

Economics pushes for using commodity parts,especially since they have high potential. Unfortunately

No useful GPU programming standardsMulticores differ on caching providedNo good way to optimize for both GPU and multicore— portable optimization not yet attained

Need better compilers to exploit parallelism (e.g., muchsmarter OpenMP compilers)

Need better ways of expressing parallelism (askDARPA, Intel, Microsoft, etc.!)


More Trends

Roadrunner grabs the headlines, but clusters andSMPs most important economically, “commodity” partsincludes chips, boards, blades,...

Increasing use of commercial parallelized software

Some parallel computing companies will fail.


Should You Parallelize?

Parallel programming is difficult — is it worthwhile?Pancake [1996] suggests first determining:

How often is program used between changes?

How much time does it take (or is expected to take)?

How satisfied are users with current results?Need more resolutionNeed results fasterWill be flooded with data,. . .


Degrees of Difficulty

Some problems much easier to parallelize than others.Classes of problems range from

Embarrassingly parallel Separate jobs with no interaction,easy to run on any system.

Static Important load-balancing parameters, such as size,known in advance. Often run same configuration manytimes.

Data-dependent Dynamic Often quite difficult to achieveefficient implementation.


Review: Software Engineering

Standard languages (e.g., MPI, OpenMP) and toolsreduce learning curve and preserve investment.

Start with overview of data structures & timerequirements, do profiling as needed.

Prioritize sections to be parallelized, and adapt as youlearn.

Parallelize at the outermost loop possible

Proceed incrementally, constantly verifying correctness


Review: Efficiency

Reduce communication costs:maximize data localityeliminate false sharing in shared memory systemscombine messages to reduce overhead andsynchronizationsend data (distributed memory) or write data (sharedmemory) early, receive or read late.

Reduce load imbalance and synchronization.

Utilize compiler optimizations, optimized routines, etc.


If It Isn’t Working Well . . .

The original program probably wasn’t written withparallelism in mind

See if there is a more parallelizable approach

Sometimes parallelizable approaches aren’t the mostefficient ones available for serial computers, but that isOK if you are going to use many processors.


If It Isn’t Working Well . . .

The original program probably wasn’t written withparallelism in mind

See if there is a more parallelizable approach

Sometimes parallelizable approaches aren’t the mostefficient ones available for serial computers, but that isOK if you are going to use many processors.

Remember Amdahl’s Law:

Efficient massive parallelism is difficult.


Finally • • •


Finally • • •

Make sure your goals arerealistic, and remember thatyour own time is valuable.


REFERENCES

Sellected web resources for parallel computing are(occasionally) maintained at

http://www.eecs.umich.edu/ ˜ qstout/parlinks.html


http://www.eecs.umich.edu/~qstout/parlinks.html

References

[G. Amdahl 1967], “Validity of the single processor approach to achieving large scalecomputing capabilities”, AFIPS Conf. Proc. 30 (1967), pp. 483–485.

Co-Array Fortran: http://www.co-array.org .

[M.J. Flynn 1966], “Very high-speed computing systems”, Proc. IEEE 54 (1966),pp. 1901–1909.

[J.L. Gustafson 1988], “Reevaluating Amdahl’s Law”, Communications of the ACM 31(1988), pp. 532–533.

Hadoop: http://hadoop.apache.org/core/

Hilbert space-filling curve: see the routines available in Zoltan (listed below).

[MacNeice et al. 2002], “PARAMESH: A parallel adaptive mesh refinement communitytoolkit”, Comp. Physics Commun. 128 (2000), pp. 330–354.

Metis and Parmetis: http://www.cs.umn.edu/ ˜ karypis/metis/

MPI: documentation at http://www.mpi-forum.org/

Free, portable versions at:http://www.mcs.anl.gov/research/projects/mpich2



http://www.co-array.org

http://hadoop.apache.org/core/

http://www.cs.umn.edu/~karypis/metis/

http://www.mpi-forum.org/

http://www.mcs.anl.gov/research/projects/mpich2


References continued

OpenMP: http://openmp.org/wp/ .

[C.M. Pancake 1996], “Is parallelism for you?”, IEEE Comp. Sci. & Engin., 3 (1996)pp. 18–37.

[Pancake, Simmons, and Yan 1995], “Performance evaluation tools for parallel anddistributed systems, IEEE Computer, Vol. 28, No. 11 (1995) pp. 16–20.

Parallel computing, a slightly whimsical explanationhttp://www.eecs.umich.edu/ ˜ qstout/parallel.html

Roadrunner: http://www.lanl.gov/roadrunner/index.shtml

[Stout et al. 1997], “Adaptive blocks: A high-performance data structure”, Proc. SC’97.http://www.eecs.umich.edu/ ˜ qstout/abs/SC97.html

Top500. Website with extensive collection of references: http://www.Top500.org

UPC (Unified Parallel C): http://upc.gwu.edu .

Zoltan (collection of routines for load balancing et al.).http://www.cs.sandia.gov/Zoltan .


http://openmp.org/wp/

http://www.eecs.umich.edu/~qstout/parallel.html

http://www.lanl.gov/roadrunner/index.shtml

http://www.eecs.umich.edu/~qstout/abs/SC97.html

http://www.Top500.org

http://upc.gwu.edu

http://www.cs.sandia.gov/Zoltan

quentin f. stout christiane jablonowski · atm networks, digital multimedia parallel computers can...

Documents