quentin f. stout christiane jablonowski · atm networks, digital multimedia parallel computers can...

341
Parallel Computing 101 Quentin F. Stout Christiane Jablonowski University of Michigan Copyright c 2008 Stout and Jablonowski – p. 1/324

Upload: others

Post on 21-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Computing 101

Quentin F. Stout Christiane Jablonowski

University of Michigan

Copyright c© 2008

Stout and Jablonowski – p. 1/324

Page 2: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Organization

Part I Introduction, TerminologyExample (crash simulation)Speedup and Efficiency, Amdahl’s LawArchitecturesDistributed Memory Communication, MPIParallelizing Serial Programs ILoad Balancing IShared Memory, OpenMP

Stout and Jablonowski – p. 2/324

Page 3: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Organization cont.

Part II Hybrid ComputingVector Computing, Climate ModelingParallelizing Serial Programs IILoad Balancing IIData Intensive ComputingPerformance Improvement, ToolsUsing and Buying Parallel SystemsReview, Wrapup

Stout and Jablonowski – p. 3/324

Page 4: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

INTRODUCTION

In this part we introduce parallel computing and someuseful terminology. We examine many of the variations insystem architecture, and how they affect the programmingoptions.

We will look at a representative example of a largescientific/engineering code, and examine how it wasparallelized. We also consider some additional examples.

Stout and Jablonowski – p. 4/324

Page 5: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Why use Parallel Computers?

Parallel computers can be the only way to achievespecific computational goals at a given time.

PetaFLOPS and Petabytes for Grand Challengeproblemskilo-transactions per second for search engines,ATM networks, digital multimedia

Parallel computers can be the cheapest or easiest wayto achieve a specific computational goal at a given time:e.g., cluster computers made from commodity parts.

Parallel computers can be made highly fault-tolerantnonstop computing at nuclear reactorsweb search

Stout and Jablonowski – p. 5/324

Page 6: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Why Parallel Computing — continued

The universe is inherently parallel, so parallel models fitit best.

Physical processes occur in parallel:weather, galaxy formation, nuclear reactions,epidemics, . . .

Social/work processes occur in parallel:ant colonies, wolf packs, assembly lines, stockexchange, tutorials, . . .

Stout and Jablonowski – p. 6/324

Page 7: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Basic Terminology and Concepts

Caveats

The definitions are fuzzy, many terms are notstandardized, definitions often change over time.

Many algorithms, software, and hardware systems donot match the categories, often blending approaches.

No attempt to cover all models and aspects of parallelcomputing. For example, quantum computing notincluded.

Stout and Jablonowski – p. 7/324

Page 8: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Computing Thesaurus

Parallel Computing Solving a task by simultaneous use ofmultiple processors, all components of a unifiedarchitecture.

Embarrassingly Parallel Solving many similar, butindependent, tasks. E.g., parameter sweeps.

Symmetric Multiprocessing (SMP) Multiple processors sharinga single address space and access to all resources.

Multi-core Processors Multiple processors (cores) on a singlechip. Aka many-core. Heterogeneous multi-core chipswith GPU being developed.

Cluster Computing Hierarchical combination of commodityunits (processors or SMPs) to build parallel system.

Stout and Jablonowski – p. 8/324

Page 9: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Thesaurus continued

Supercomputing Use of the fastest, biggest machines tosolve large problems. Historically vector computers, butnow are parallel or parallel/vector.

High Performance Computing Solving problems viasupercomputers + fast networks + visualization.

Pipelining Breaking a task into steps performed by differentunits, with inputs streaming through, much like anassembly line.

Vector Computer Operation such as multiply broken intoseveral steps and applied to a stream of operands(pipelining with “vectors”).

Stout and Jablonowski – p. 9/324

Page 10: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Pipelining, Detroit Style

Stout and Jablonowski – p. 10/324

Page 11: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Who Uses Supercomputers?

Historically, the military (nuclear simulations, cryptography).Weather forecasting was the civilian application.

These continue to be major users but now many morecivilian users.

The following charts are from the Top 500 list, showing thestatus as of June. The newest list has just been announcedand is on the Top500 website:

http://www.top500.org

Stout and Jablonowski – p. 11/324

Page 12: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Top500: Performance

Stout and Jablonowski – p. 12/324

Page 13: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Top500: Application Systems

Stout and Jablonowski – p. 13/324

Page 14: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Top500: Architecture Systems

Stout and Jablonowski – p. 14/324

Page 15: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Top500: Vendor Systems

Stout and Jablonowski – p. 15/324

Page 16: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

CRASH SIMULATION

A greatly simplified model, based on parallelizing crashsimulation for Ford Motor Company. Such simulations savea significant amount of money and time compared to testingreal cars.

This example illustrates various phenomena which arecommon to a great many simulations and other large-scaleapplications.

Stout and Jablonowski – p. 16/324

Page 17: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Finite Element Representation

Car is modeled by a triangulated surface (theelements).

The simulation consists of modeling the movement ofthe elements during each time step, incorporating theforces on them to determine their new position.

In each time step, the movement of each elementdepends on its interaction with the other elements that itis physically adjacent to.

Stout and Jablonowski – p. 17/324

Page 18: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

The Car of the Future

Stout and Jablonowski – p. 18/324

Page 19: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Basic Serial Crash Simulation

1 For all elements

2 Read State(element), Properties(element),

Neighbor_list(element)

3 For time=1 to end_of_simulation

4 For element = 1 to num_elements

5 Compute State(element) for next time step,

based on previous state of element and its

neighbors, and on properties of element

Periodically State is stored on disk for later visualization.Stout and Jablonowski – p. 19/324

Page 20: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Simple approach to parallelization

Parallel computer based on PC-like processors linked witha fast network, where processors communicate viamessages. Distributed memory or message-passing .

Cannot parallelize time, so parallelize space.

Distribute elements to processors, each processor updatesthe positions of the elements it contains: owner computes .

All machines run the same program: SPMD , singleprogram multiple data.

SPMD is the dominant form of parallel computing.

Stout and Jablonowski – p. 20/324

Page 21: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

A Distributed Car

Stout and Jablonowski – p. 21/324

Page 22: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Basic Parallel Version

Concurrently for all processors P

1 For all elements assigned to P

2 Read State(element), Properties(element),

Neighbor-list(element)

3 For time=1 to end-of-simulation

4 For element = 1 to num-elements-in-P

5 Compute State(element) for next time step,

based on previous state of element and its

neighbors, and on properties of element

Stout and Jablonowski – p. 22/324

Page 23: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Software Engineering Aspects

Most parallel code the same as, or similar to, serial code,reducing parallel development and life-cycle costs, andhelping keep parallel and serial versions compatible.

Life-cycle costs are often overlooked until it is too late!

Note that high-level structure same as serial version: asequence of steps. The sequence is a serial construct, butsteps are performed in parallel.

Stout and Jablonowski – p. 23/324

Page 24: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Some Basic Questions: Allocation

How are elements assigned to processors?

Stout and Jablonowski – p. 24/324

Page 25: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Some Basic Questions: Allocation

How are elements assigned to processors?

Typically element assignment determined by serialpreprocessing, using domain decompositionapproaches (load-balancing) described later.

Stout and Jablonowski – p. 24/324

Page 26: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Separation?

How does processor keep track of adjacency info forneighbors in other processors?

Stout and Jablonowski – p. 25/324

Page 27: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Separation?

How does processor keep track of adjacency info forneighbors in other processors?

Use ghost cells (halo ) to copy remote neighbors, addtranslation table to keep track of their location andwhich local elements copied elsewhere.

Stout and Jablonowski – p. 25/324

Page 28: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Ghost Cells

Stout and Jablonowski – p. 26/324

Page 29: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Update?

How does a processor use State(neighbor) when it doesnot contain the neighbor element?

Stout and Jablonowski – p. 27/324

Page 30: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Update?

How does a processor use State(neighbor) when it doesnot contain the neighbor element?

Could request state information from processorcontaining the neighbor. However, more efficient if thatprocessor sends it.

Stout and Jablonowski – p. 27/324

Page 31: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Coding and Correctness?

How does one manage the software engineering of theparallelization process?

Stout and Jablonowski – p. 28/324

Page 32: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Coding and Correctness?

How does one manage the software engineering of theparallelization process?

Utilize an incremental parallelization approach.

Constantly check test cases to make sure answerscorrect.

Stout and Jablonowski – p. 28/324

Page 33: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Efficiency?

How do we evaluate the success of the parallelization, andif not successful, how do we improve it?

Stout and Jablonowski – p. 29/324

Page 34: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Efficiency?

How do we evaluate the success of the parallelization, andif not successful, how do we improve it?

Evaluate via speedup or efficiency metrics, improve viaprofiling, iterative refinement.

Stout and Jablonowski – p. 29/324

Page 35: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Evaluating Parallel Programs

An important component of effective parallel computing isdetermining whether the program is performing well. If it isnot running efficiently, or cannot be scaled to the targetnumber of processors, then one needs to determine thecauses of the problem and develop better approaches.

Stout and Jablonowski – p. 30/324

Page 36: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Definitions

For a given problem A, let

SerTime(n) = Time of best serial algorithm to solve A forinput of size n.

ParTime(n,p) = Time of the parallel algorithm+architecture tosolve A for input of size n, using p processors.

Note that SerTime(n) ≤ ParTime(n,1).

Speedup: SerTime(n) / ParTime(n,p)

Work (cost): p · ParTime(n,p)

Efficiency: SerTime(n) / [p · ParTime(n,p)]

Stout and Jablonowski – p. 31/324

Page 37: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

In general, expect:

0 < Speedup ≤ p

Serial Work ≤ Parallel Work < ∞0 < Efficiency ≤ 1

Technically, speedup is linear if there is a constant c > 0 sothat speedup is at least c · p. However, many use this termto mean c = 1.

Always involves some restriction on relationship of p and n,e.g., p ≤ n, or p =

√n.

Stout and Jablonowski – p. 32/324

Page 38: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Observed Speedup

Number of Processors

S

p

e

e

d

u

p

Per

fect

Occasional

Common

Stout and Jablonowski – p. 33/324

Page 39: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Superlinear Speedup

Very rare. Some reasons for speedup > p (efficiency > 1)

Parallel computer has p times as much RAM so higherfraction of program memory in RAM instead of disk.An important reason for using parallel computers

In developing parallel program a better algorithm wasdiscovered, older serial algorithm was not best possible.A useful side-effect of parallelization

Parallel computer is solving slightly different, easierproblem, or providing slightly different answer.Questionable practice

Stout and Jablonowski – p. 34/324

Page 40: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Amdahl’s Law

Amdahl [1967] noted: given a program, let f be fraction oftime spent on operations that must be performed serially.Then for p processors,

Speedup(p) ≤ 1

f + (1− f)/p.

(Right hand side assumes perfect parallelization of (1-f) part of program)

Thus no matter how many processors are used:

Speedup ≤ 1/f

Unfortunately, typically f was 10 – 20%

Stout and Jablonowski – p. 35/324

Page 41: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Useful rule of thumb:

If maximal possible speedup is S, then Sprocessors run at about 50% efficiency.

Stout and Jablonowski – p. 36/324

Page 42: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Maximal Possible Speedup

1 2 4 8 16

32 64

128 256

512 1024

Processors

1

2

4

8

16

32

64

128

256

512

1024

Spe

edup

f=0.1 f=0.01 f=0.001

Stout and Jablonowski – p. 37/324

Page 43: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Maximal Possible Efficiency

1 2 4 8 16

32 64

128 256

512 1024

Processors

0.0

0.2

0.4

0.6

0.8

1.0

1.2 E

ffici

ency

f=0.1 f=0.01 f=0.001

Stout and Jablonowski – p. 38/324

Page 44: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Amdahl Was an Optimist

Parallelization usually adds work, typically communication,which reduces speedup.

For example, crash simulation typically runs for a fixedsimulated time interval. Due to the physics of the situation,if use n finite elements, number of time steps grows like√

n, so serial processor time grows like

C1 · n1.5

for some C1 > 0.

Stout and Jablonowski – p. 39/324

Page 45: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Additional Parallel Communication

Suppose use p processors. Every time step processorsreceive and send information about border elements. Thereis also periodic global communication of total energy,contact, etc.

For simple approaches, communication time grows like

√n

(

C2 · p + C3

n/p)

, C2, C3 > 0

Stout and Jablonowski – p. 40/324

Page 46: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Effect of Communication

Suppose C2 = C1 = 10 and C3 = 1. Then for n = 1000 weget the following speedup.

1 2 4 8 16

32 64

128 256

512 1024

Processors

0

5

10

15

20 S

peed

up

Stout and Jablonowski – p. 41/324

Page 47: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Amdahl was a Pessimist

Amdahl convinced many that general-purpose parallelcomputing was not viable. Fortunately, we can skirt the law.

Algorithm: May be new algorithms with much smallervalues of f — Necessity is the mother of invention.

Memory hierarchy: Possibility more time spent in RAM thandisk — Superlinear Speedup.

Scaling: Usually time spent in serial portion of code is adecreasing fraction of the total time as problem sizeincreases — Scaling.

Stout and Jablonowski – p. 42/324

Page 48: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Common Program Structure

Serial, grows slowly with n

Serial, grows slowly with n

Parallelizable loop, grows with n

Parallelizable loop within loopgrows very rapidly with n

Serial, fixed time

Sometimes serialportions grow withproblem size butmuch slower thanthe total time.

I.e., Amdahl’s“f” decreases asn increases

Stout and Jablonowski – p. 43/324

Page 49: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Scaling

For such programs, can often exploit large parallelmachines by scaling the problems to larger instances.

To illustrate, use a model like the crash simulation

SerT ime(n) = 10 · n1.5

and the time for p parallel processors grows like

ParT ime(n, p) = 10 · n1.5/p + 10 · p√

n + n/√

p

Stout and Jablonowski – p. 44/324

Page 50: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Fixed Size per Processor

Fixing the amount of data per processor usually giveshighest efficiency possible, hence it is commonly cited.Called weak scaling .

Suppose each processor can hold 1000 elements.Constant Size per Processor

1 2 4 8 16

32 64

128 256

512 1024

Processors

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Effi

cien

cy

Stout and Jablonowski – p. 45/324

Page 51: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Fixed TimeFix time, find largest problem solvable. Commonly used inevaluating database servers, transactions per second.[Gustafson 1988] considered this for general computing.

Fix time to be SerTime(1000).Constant Time

1 2 4 8 16

32 64

128 256

512 1024

Processors

4000

8000

12000 16000

n

Stout and Jablonowski – p. 46/324

Page 52: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Fixed Efficiency

Fix efficiency, find smallest problem needed to achieve thatefficiency (isoefficiency analysis).

For example, for 90% efficiency:Constant Efficiency of 0.9

1 2 4 8 16

32 64

128 256

512 1024

Processors

100

1000

10000

100000

1000000

10000000

n

Stout and Jablonowski – p. 47/324

Page 53: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Scalability

Linear speedup is very rare, due to communicationoverhead, load imbalance, algorithm/architecture mismatch,etc.

Several attempts have been made to give definitions forscalable architectures, algorithms, or algorithm-architecture combinations. However, for most users, theimportant question is:

Have I achieved acceptable performance on mysoftware/hardware system for a suitable range of data andmachine sizes?

Stout and Jablonowski – p. 48/324

Page 54: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

ARCHITECTURAL TAXONOMIES

These classifications provide ways to think about problemsand their solution.

The classifications are in terms of hardware, but there arenatural software analogues.

Note: many systems blend approaches, and do not exactlycorrespond to the classifications.

Stout and Jablonowski – p. 49/324

Page 55: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Flynn’s Instruction/Data Taxonomy

[Flynn, 1966] At any point in time can have{

S

M

}

I

{

S

M

}

D

SI Single Instruction: All processors execute the sameinstruction. Usually involves a central controller.

MI Multiple Instruction: Different processors may beexecuting different instructions.

SD Single Data: All processors are operating on the samedata.

MD Multiple Data: Different processors may be operating ondifferent data.

Stout and Jablonowski – p. 50/324

Page 56: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

SISD: standard serial computer and program.

MISD is rare — some extreme fault-tolerance schemes,using different computers and programs to operate onsame input data, are of this type.

Almost all parallel computers are MIMD.

SIMD: there used to be companies that made suchsystems (Thinking Machines’ Connection Machine wasthe most famous).

Vector computing is a form of SIMD.

Stout and Jablonowski – p. 51/324

Page 57: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Instructions

A SIMD System

Processors, with data

Controller, with program

Stout and Jablonowski – p. 52/324

Page 58: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

SIMD Software

Data parallel software — do the same thing to all elementsof a structure (e.g., many matrix algorithms). Easy to writeand understand. Unfortunately, difficult to apply to complexproblems (as were the SIMD machines).

SPMD, Single Program Multiple Data : can be viewed as anextension of the SIMD approach to programming for MIMDsystems.

Stout and Jablonowski – p. 53/324

Page 59: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Memory Systems: Distributed Memory

All memory is associated with processors.

To retrieve information from another processor’smemory a message must be sent over the network tothe home processor. Usually organize program so thatthe owner sends it to the requestor before being asked.

Advantages:Memory is scalable with number of processorsEach processor has rapid access to its own memorywithout interference or cache coherency problemsCost effective and easier to build: can usecommodity parts

Stout and Jablonowski – p. 54/324

Page 60: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Disadvantages

Programmer is responsible for many of the details of thecommunication, easy to make mistakes.

May be difficult to distribute the data structures, oftenneed to revise them to add additional pointers.

Stout and Jablonowski – p. 55/324

Page 61: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Memory Systems: Shared Memory

Global memory space, accessible by all processors

Processors may have local memory to hold copies ofsome global memory.

Consistency of these copies is usually maintained byhardware.

Advantages:Global address space is user-friendly, program maybe able to use global data structures efficiently andwith little modification.Data sharing between tasks is fast

Stout and Jablonowski – p. 56/324

Page 62: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Disadvantages

System may suffer from lack of scalability betweenmemory and CPUs. Adding CPUs increases traffic onshared memory - to - CPU path. This is especially truefor cache coherent systems

Programmer is responsible for correct synchronization

Needs some special-purpose components.

Stout and Jablonowski – p. 57/324

Page 63: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Shared vs. Distributed

Network

Processor + Cache

Memory

Processor + Cache + Memory

Network

DISTRIBUTED MEMORY

SHARED MEMORY

Stout and Jablonowski – p. 58/324

Page 64: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Shared Memory Access Time

Two classes of SM systems based on memory access time:

Uniform Memory Access (UMA):

Most commonly represented by Symmetric Multi-processor Machines (SMP), identical processors

Equal access times to memory

Some systems are CC-UMA (cache coherent UMA): ifone processor updates a variable in shared memory, allthe other processors know about the update.

Stout and Jablonowski – p. 59/324

Page 65: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

SM Access Time continued

Non-Uniform Memory Access (NUMA):

Often made by physically linking two or more SMPs

One SMP can directly access memory of another SMP(not message-passing)

Memory access times are not uniform, memory accessacross a link is slower

Cache coherent systems: CC-NUMA

Stout and Jablonowski – p. 60/324

Page 66: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Shared Memory on Distributed Memory

As we’ll see later, it is usually easier parallelize a programon a shared memory system.

However, most systems are distributed memory because ofthe cost advantages.

To gain both advantages people have investigated virtualshared memory , or global address space (GAS) , usingsoftware to simulate shared memory access.

Current projects include Unified Parallel C (UPC) andCo-Array Fortran.

Stout and Jablonowski – p. 61/324

Page 67: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Virtual Shared Memory Performance

Communication time in distributed memory machines isquite high. Thus virtual shared memory access is highlynonuniform, being vastly faster if the data is stored with theprocessor requesting it.

Because of these access delays, the performance of thesesystems is not good, even if reasonable care is taken, butmay be justified by greatly reduced programmer time.

Stout and Jablonowski – p. 62/324

Page 68: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Virtual Shared Memory Performance

Communication time in distributed memory machines isquite high. Thus virtual shared memory access is highlynonuniform, being vastly faster if the data is stored with theprocessor requesting it.

Because of these access delays, the performance of thesesystems is not good, even if reasonable care is taken, butmay be justified by greatly reduced programmer time.

Software and hardware models need not match, thoughthere are often performance problems when they don’t.

Stout and Jablonowski – p. 62/324

Page 69: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Communication Network

There are a many ways that the processors can beinterconnected but for the user the differences are usuallyminor. Two main classes that do have some impact:

Bus Processors (and memory) connected to a common busor busses, much like a local Ethernet.

Memory access fairly uniform, but not very scalabledue to contention.

Switching Network Processors (and memory) connected torouting switches like in telephone system.

Usually NUMA, blocking, though a cross-bar isnon-blocking (but a cross-bar is not scalable).

Stout and Jablonowski – p. 63/324

Page 70: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Networks

Switch Processor

MultistageInterconnect

Bus

Stout and Jablonowski – p. 64/324

Page 71: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Example: Symmetric Multiprocessors

Shared memory system, processors share work.

When a processor reads or writes to RAM, datatransported over a bus, local copy in processor cache.

Rules needed to ensure that different caches don’tcontain different values for the same memory locations(cache coherency). This is easier on bus-basedsystems than on more general interconnectionnetworks.

Because all processors use the same memory bus,there is limited scalability due to bus contention.

Multicore processors, which are SMPs, are becomingthe standard processors in all systems.

Stout and Jablonowski – p. 65/324

Page 72: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Low-Cost Parallel Systems

Systems built from commodity parts are becomingwidespread due to low cost and acceptable performance.

Clusters (NOW, Beowulfs, etc.): commodity processorboards with multicore processors and commodityinterconnects (e.g., Gigabit Ethernet). Often rackmounted.

SMPs: quite common as departmental servers.

Clusters of SMP nodes: rapidly gaining in importance,small SMPs available rack-mounted. Sometimes calledclumps.

Stout and Jablonowski – p. 66/324

Page 73: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

However, communication on low-cost clusters often slow,typically due to software which relies on basic networkingstack. Some companies (Myrinet, Force10, etc.) markethigh-speed networks and special software to reduce this.

Constellations use much larger, much more expensive,shared memory units as nodes in a distributed memorysystem. Usually a high-performance interconnect is usedbetween the nodes.

Note: Many clusters are primarily used for embarrasingly parallelcomputation and do not need high-performance networking.

Stout and Jablonowski – p. 67/324

Page 74: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

The Memory Hierarchy

The mismatch of processor speed and memory speedcauses a bottleneck. There is an inverse relationshipbetween memory speed and $/byte, and there are physicalconstraints on the size of memory. Thus memory arrangedin a hierarchy:

registers

cache (perhaps itself hierarchical)

RAM (“primary memory”)

disk (“secondary memory”)

tapes or CDs (“tertiary memory”)

Stout and Jablonowski – p. 68/324

Page 75: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Speed-Size Tradeoff

Cache MByte

Ram

Disk 10 millisec

Tape minute 100 TByte

100 GByte

GByte

nanosec

100 nanosecSpeed

Size

When moving between levels beyond the registers, anentire block is moved at once (cache lines, pages).Effective high-performance computing (serial or parallel)includes arranging data and program so that entire block isused while resident in the faster memory.

Stout and Jablonowski – p. 69/324

Page 76: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Multiprocessor Caching

Parallel computing compounds the memory hierarchy:remote memory is far slower to access than local memory.

Caching widely used, fetching blocks of data instead ofindividual items. Data fetched when referenced, sometimesprefetched before it is needed.

If data locality is high then effective memory accesstime is decreased

Reduces network traffic.

However, creates a cache coherence problem,

False sharing caused by cache lines can significantlydegrade performance.

Stout and Jablonowski – p. 70/324

Page 77: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MESSAGE PASSING

On distributed memory systems, also called messagepassing systems, communication is often an importantaspect of performance and correctness.

Stout and Jablonowski – p. 71/324

Page 78: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Communication Speed

On most distributed memory systems, messages arerelatively slow, with startup (latency) times taking thousandsof cycles (and far more for many clusters).

Typically, once the message has started, the additional timeper byte (bandwidth) is relatively small.

Stout and Jablonowski – p. 72/324

Page 79: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Measured Performance

For example, a 4.7 GHz IBM Power 6 (p575) processor,best case MPI messages (discussed later):

processor speed: 4700 cycles per microsecond (µsec),4 flops/cycle, 18800 flops per µsec.

MPI message latency, caused by software:≈ 1.3 µsec = 24,400 flops

message bandwidth, usually limited by hardware:≈ 2500 bytes per µsec = 7.5flops/byte

Your performance may vary!

Stout and Jablonowski – p. 73/324

Page 80: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Reducing Latency

Reducing the effect of high latency often important forperformance. Some useful approaches:

Reduce the number of messages by mappingcommunicating entities onto the same processor.

Combine messages having the same sender anddestination.

If processor P has data needed by processor Q, have Psend to Q, rather than Q first requesting it. P shouldsend as soon as data ready, Q should read as late aspossible to increase probability data has arrived.

Send Early, Receive Late, Don’t Ask but Tell.

Stout and Jablonowski – p. 74/324

Page 81: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Messages and Computations

Even when data is sent far in advance of its use, messagepassing can cause performance degradation. Can try tooverlap communication and calculation.

Unfortunately:

Many systems incapable of doing this.

Latency dominantly due to software, initiating messageties up processor.

Even with co-processor, memory bus may be tied up,interfering with main processor’s use of it.

Expensive communication systems try to ovecome theseproblems.

Stout and Jablonowski – p. 75/324

Page 82: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Deadlock

If messages blocking , i.e., if processor can’t proceed untilthe message is finished, then can reach deadlock , whereno processor can proceed.

Example: Processor A sends message to B while B sendsto A. If blocking sends, neither finishes until the otherfinishes receiving, but neither starts receiving until sendfinished.

This can be avoided by A doing send then receive, while Bdoes receive then send. However, often difficult tocoordinate when there are many processors.

Stout and Jablonowski – p. 76/324

Page 83: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Often easiest to prevent deadlock by non-blockingcommunication, where processor can send and proceedbefore receive is finished.

However, requires receiver buffer space which may fill,reducing to blocking case, and extra copying of messages,reducing performance.

Stout and Jablonowski – p. 77/324

Page 84: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Message Passing Interface — MPI

An important communication standard. We will show somesnippets of MPI to illustrate some of the issues, but MPI is amajor topic that we cannot address in detail. Fortunately,many programs need only a few MPI features. There aremany implementations of MPI:

MPICH homepage http://www-unix.mcs.anl.gov/mpi

Open MPI homepage http://www.open-mpi.org/

Stout and Jablonowski – p. 78/324

Page 85: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Some Reasons for Using MPI

Standardized, with process to keep it evolving.

Available on almost all parallel systems (free MPICH,Open MPI used on many clusters), with interfaces for Cand Fortran.

Supplies many communication variations and optimizedfunctions for a wide range of needs.

Supports large program development and integration ofmultiple modules.

Many powerful packages and tools based on MPI.

Stout and Jablonowski – p. 79/324

Page 86: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

While MPI large (> 100 functions), usually need veryfew functions (6-10), giving gentle learning curve.

Various training materials, tools and aids for MPI.

Good introductory MPI tutorialhttp://www.llnl.gov/computing/tutorials/mpi/

Basic and advanced MPI tutorials, e.g. on I/O andone-sided communicationhttp://www-unix.mcs.anl.gov/mpi/tutorial/

Stout and Jablonowski – p. 80/324

Page 87: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

While MPI large (> 100 functions), usually need veryfew functions (6-10), giving gentle learning curve.

Various training materials, tools and aids for MPI.

Good introductory MPI tutorialhttp://www.llnl.gov/computing/tutorials/mpi/

Basic and advanced MPI tutorials, e.g. on I/O andone-sided communicationhttp://www-unix.mcs.anl.gov/mpi/tutorial/

Writing MPI-based parallel codes helps preserve yourinvestment as systems change.

Stout and Jablonowski – p. 80/324

Page 88: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Basics

The overwhelmingly most frequently used MPIcommands are variants of

MPI_SEND() to send data, andMPI_RECV() to receive it.

These function very much like write & read statements.

Point-to-point communication

MPI_SEND() and MPI_RECV() are blocking operations.

Blocking communication can be unsafe and may lead todeadlocks.

Stout and Jablonowski – p. 81/324

Page 89: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Blocking MPI Communication

MPI_SEND() does not complete until thecommunication buffer is empty

MPI_RECV() does not complete until thecommunication buffer is full

Send-recv handshake works for small messages, butmight fail for large messages

Allowable size of the message depends on MPIimplementation (buffer sizes), could also behardware-dependent

Even if it works, the data usually get copied into amemory buffer

Copies are slow (avoid), poor performance

Stout and Jablonowski – p. 82/324

Page 90: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Non-Blocking MPI Communication

Better solution: use non-blocking operations

MPI_ISEND()MPI_IRECV()MPI_WAIT()

The user can also check for the data at a later stage inthe program without waiting:

MPI_TEST()

Non-blocking operations boost the performance.

Other non-blocking send and receive operationsavailable.

Possible overlap of communication with computation.

However, few system can provide the overlap, oftenalready limited by the memory bandwidth.

Stout and Jablonowski – p. 83/324

Page 91: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Initialization

Near the beginning of the program, include

#include "mpi.h"MPI_Init(&argc, &argv)MPI_Comm_rank(MPI_COMM_WORLD, &my_rank)MPI_Comm_size(MPI_COMM_WORLD,

&num_processors)

These help each processor determine its role in the overallscheme.

There is MPI_Finalize() at the end.

These 4 MPI functions, together with MPI send and receiveoperations, are already sufficient for simple applications.

Stout and Jablonowski – p. 84/324

Page 92: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Example

Each processor sends value to proc. 0, which adds them.

0

1

2

3

4

5

6

7

8

Stout and Jablonowski – p. 85/324

Page 93: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Basic Program

initializeif (my_rank == 0){

sum = 0.0;for (source=1; source<num_procs; source++){

MPI_RECV(&value, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);

sum += value;}

} else {MPI_SEND(&value, 1, MPI_FLOAT, 0, tag,

MPI_COMM_WORLD);}finalize

Stout and Jablonowski – p. 86/324

Page 94: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Improving Performance

In the initial version, processor 0 received the messages inprocessor order. However, if processor 1 delayed sendingits message, then processor 0 would also be delayed.

For a more efficient version: modify MPI_RECV to

MPI_Recv(&value, 1, MPI_FLOAT,MPI_ANY_SOURCE, tag,MPI_COMM_WORLD, &status);

Now processor 0 can start processing messages as soonas any arrives.

Stout and Jablonowski – p. 87/324

Page 95: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Reduction Operations

Operations such as summing are common, combining datafrom every processor into a single value. These reductionoperations are so important that MPI provides directsupport for them, and parallelizing compilers recognizethem and generate efficient code.

Could replace all communication with

MPI_REDUCE(&value, &sum, 1, MPI_FLOAT,MPI_SUM, 0, MPI_COMM_WORLD)

Examples of Collective Operations:

MPI_SUM, MPI_MAX, MPI_MIN, MPI_PROD

MPI_LAND (logical and), MPI_LOR (logical or)

Stout and Jablonowski – p. 88/324

Page 96: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Collective Communication

The opposite of reduction is broadcast : one processorsends to all others.

Reduction, broadcast, and others are collectivecommunication operations, the next most frequentlyinvoked MPI routines after send and receive.

MPI collective communication routines improve clarity, runfaster, and reduce chance of programmer error.

Stout and Jablonowski – p. 89/324

Page 97: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Collective Communication

Broadcast

AP0

P3

P2

P1 P1

P0

P2

P3

P3

P2

P1 P1

P0

P3

P0

P2

A

A

A

A

A A

C

D

B C D

B

Scatter

Gather

Stout and Jablonowski – p. 90/324

Page 98: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Collective Communication

All gatherAP0

P3

P2

P1 P1

P0

P2

P3

P3

P2

P1 P1

P0

P3

P0

P2

A

A

A

A

B

C

D

B C D

B

B

B

C

C

C

D

D

D

All to all

A0 A0

C3

A1 A2 A3

B0 B1 B2 B3

C0 C1 C2

D0 D1 D2 D3

B0 C0 D0

D1C1B1A1

A2

A3 B3

B2 C2

C3

D2

D3

Stout and Jablonowski – p. 91/324

Page 99: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Synchronization

Synchronization is provided

implicitly byBlocking communicationCollective communication

explicitly byMPI_Wait, MPI_Waitany operations for non-blockingcommunication:May be used to synchronize a few or all processorsMPI_Barrier statement:Blocks until all MPI processes have reached barrier

Avoid synchronizations as much as possible to boostperformance.

Stout and Jablonowski – p. 92/324

Page 100: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Datatypes

Predefined basic datatypes, corresponding to theunderlying programming language, examples are

FortranMPI_INTEGERMPI_REAL, MPI_DOUBLE_PRECISION

CMPI_INTMPI_FLOAT, MPI_DOUBLE

Derived data types:Vector: data separated by constant strideContiguous: vector with stride 1Struct: general mixed types (e.g. for C struct)Indexed: Array of indices

Stout and Jablonowski – p. 93/324

Page 101: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Datatype: Vector

Consider a block of memory (e.g. a matrix with integernumbers):

10

15

5

2

3

4 8

9

11

12 16 20 24

2319

18 22

211713

146

7

1

To specify the gray row (in Fortran order), useMPI_Type_vector( count, blocklen, stride, old_datatype,

new_datatype, ierr)MPI_Type_commit (new_datatype, ierr)

Stout and Jablonowski – p. 94/324

Page 102: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Datatype: Vector

In the example, we get

MPI_Type_vector( 6, 1, 4, MPI_INTEGER,my_vector, ierr)

MPI_Type_commit (my_vector, ierr)

The new datatype my_vector is a vector that contains 6blocks, each of 1 integer number, with a stride of 4integers between blocks.

Here, we introduce the Fortran notation of the MPIroutines (with additional error flag ”ierr”).

Fortran, C and C++ notations are very similar.

Stout and Jablonowski – p. 95/324

Page 103: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Some Additional MPI Features

Procedures for creating virtual topologies, e.g., indexingprocessors as a 2-dimensional grid.

User-created communicators (e.g., replaceMPI_COMM_WORLD), useful for selective collectivecommunication (e.g., summing along rows of a matrix),incorporating software developed separately.

Support for heterogeneous systems, MPI convertsbasic datatypes.

Additional user-specified derived datatypes

Stout and Jablonowski – p. 96/324

Page 104: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI-2

The MPI-2.1 standard was just approved by the MPI Forumon September 4, 2008, updates the MPI-2.0 standard from1997. Important added features in MPI-2.x include

Parallel I/O Critical for scalability of I/O-intensive problems.

One-sided communication Essentially “put” and “get”operations that can greatly improve efficiency on somecodes. Conceptually these are are the same as directlyaccessing remote memory.

However, these are risky and can easily introduce raceconditions.

Stout and Jablonowski – p. 97/324

Page 105: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

One-Sided Communication

Memory

Processor

Memory

ProcessorPut Get

Stout and Jablonowski – p. 98/324

Page 106: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI Summary

The MPI standard includes

point-to-point message-passing

collective communications

group and communicator concepts

process topologies (e.g. graphs)

environmental management (e.g. timers, error handling)

process creation and management

one-sided communications

external interfaces

parallel I/O routines

profiling interface

Stout and Jablonowski – p. 99/324

Page 107: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

PARALLELIZATION I

Real code is long, complex. How do we engineer theparallelization process?

Usually there is a (perhaps vague) performance goal , not,per se, a parallelization goal.

Stout and Jablonowski – p. 100/324

Page 108: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Overview of Approach

Incremental approach: tackle a bit of the problem at a timeso that one can recover from mistakes and poor attempts.

Verify: Develop test cases and constantly checkresults.

Profile: to determine where time being spent. May becoupled with modeling of code to determine whereeffort will yield most reward.

Check-point/restart: Aids testing and debugging sincesome problems only occur late in the programexecution.

Stout and Jablonowski – p. 101/324

Page 109: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Serial Performance

Often profiling reveals serial performance problems —eliminating these may be critical to attaining performancegoals.

Doubling serial performance is far more useful thandoubling the number of processors

If possible, exploit parallel (or serial) libraries, since they areusually highly tuned for target machine.

Stout and Jablonowski – p. 102/324

Page 110: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Incremental Parallelization

In shared-memory machines, can often incrementallyparallelize and increase efficiency. Portions not parallelizedwill slow the program but will at least be correct. This is amajor advantage of shared memory over distributedmemory.

Some benefits to this approach:

Smaller changes make it is easier to locate mistakes.

It is easier to determine where efficiency is poor.

Should have test cases available and constantly verifycorrectness.

Stout and Jablonowski – p. 103/324

Page 111: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

One continues incrementally until desired speedup isattained or it has been determined that the original goalimpractical. Straightforward effort/reward tradeoff, but rarelycarefully considered.

If performance critical, then often final shared memory codevery similar to distributed memory code.

Stout and Jablonowski – p. 104/324

Page 112: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallelization Process

PrioritizeChanges

Incrementallychange

Verifycorrectness

Performanceacceptable?

Code readyfor use

Analyze,profile

Set Goals

No

Yes

Stout and Jablonowski – p. 105/324

Page 113: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Process for Distributed Memory

While more complicated, an incremental approach can alsobe utilized for distributed memory machines.

It is harder to get started, but the basic approaches aresimilar. The first things one needs to do are

Do coarse-grained profiling, to determine the timeconsumed in the different sections of the program.

Develop maps of the major data structures and wherethey are used.

The profiling is used to prioritize the areas that need to beparallelized.

Stout and Jablonowski – p. 106/324

Page 114: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallelization Steps

Once parallelization plan ready, start parallelizing sectionsof code and data structures.

Initially, all processors have the complete standardserial data structures (global data structures).

As parallelize code and data structures (local datastructures), develop serial-parallel & parallel-serialconversion routines (scaffolding).

Verify correctness on test cases by showingserial–parallel–serial = serial

for global data structures.

Profile to see if efficiency of this piece is acceptable. Ifnot, then develop better alternative.

Stout and Jablonowski – p. 107/324

Page 115: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Incremental DM Parallelization

Serial-Parallel Conversion

Parallel-Serial Conversion

Parallel Code

Serial-Parallel Conversion

Parallel-Serial Conversion

Parallel Code

CodeSerial

Serial Code

Serial Code

Stout and Jablonowski – p. 108/324

Page 116: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Useful to retain the serial-parallel scaffolding (normallyturned off), to help maintain the correspondence betweenthe serial and parallel codes as they evolve.

Stout and Jablonowski – p. 109/324

Page 117: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Useful to retain the serial-parallel scaffolding (normallyturned off), to help maintain the correspondence betweenthe serial and parallel codes as they evolve.

This is probably a complex, important program, since it isworth parallelization effort. Therefore software engineeringconcerns, such as life-cycle maintenance, are veryimportant.

Stout and Jablonowski – p. 109/324

Page 118: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

LOAD-BALANCING I

Here we address the question of how one goes aboutsubdividing the computational domain among theprocessors. We introduce the basic techniques that areapplicable to most programs, with some more advancedtechniques appearing later.

Stout and Jablonowski – p. 110/324

Page 119: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Unbalanced Load

0 1 2 3 4 5 6 7

workload per processor

Average

Which processor is the most important for parallelperformance?

Stout and Jablonowski – p. 111/324

Page 120: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Domain and Functional Decomposition

Domain decomposition Partition a (perhaps conceptual)space. Different processors do similar work on differentpieces (quilting bee, teaching assistants for discussionsections, etc.)

Functional decomposition Different processors work ondifferent types of tasks (workers on an assembly line,sub-contractors on a project, etc.)

Functional decomposition rarely scales to manyprocessors, so we’ll concentrate on domain decomposition.

Stout and Jablonowski – p. 112/324

Page 121: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Dependency Analysis

There is a dependency between A and B if value of Bdepends upon A. B cannot be computed before A.

Dependencies control parallelization options.

Stout and Jablonowski – p. 113/324

Page 122: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

space

t i m e

Computational Dependencies

Stout and Jablonowski – p. 114/324

Page 123: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Space and Time

Almost always

Time or time-like variables and operations (signals,non-commutative operations, etc.) cannot beparallelized

Space or space-like variables and operations (names,objects, etc.) can be parallelized.

Some operations can have both time-like and space-likeproperties. E.g., ATM transactions are usually toindependent accounts (space-like), but ones to the sameaccount must be done in order (time-like).

Stout and Jablonowski – p. 115/324

Page 124: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Load-Balancing Variety

Many different types of load-balancing problems:

static or dynamic,

parameterized or data dependent,

homogeneous or inhomogeneous,

low or high dimensional,

graph oriented, geometric, lexicographic, etc.

Because of this diversity, need many different approachesand tools.

Stout and Jablonowski – p. 116/324

Page 125: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Complicating Factors

Objects being computed may not have a simpledependency pattern among themselves, makingcommunication load-balancing difficult to achieve.

Objects may not have uniform computationalrequirements, and it may not initially be clear whichones need more time.

If objects are repeatedly updated (such as the elementsin the crash simulation), the computational load of anobject may vary over iterations.

Objects may be created dynamically and in anunpredictable manner, complicating both computationaland communicational load balance.

Stout and Jablonowski – p. 117/324

Page 126: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Static Decompositions

Here we will consider only static decompositions of thework, with dynamic decompositions discussed later. Avariety of basic techniques available, each suitable for adifferent range of problems.

Often just evenly dividing space among the processorsyields acceptable load balance, with acceptableperformance if communication minimized. This approachworks even if the objects have varying computationalrequirements, as long as there are enough objects so thatthe worst processor is likely to be close to the average (lawof large numbers).

Stout and Jablonowski – p. 118/324

Page 127: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Which Matrix Decomposition is Best?

Suppose work at each position only depends on value thereand nearby ones, equivalent work at each position.

MinimizingBoundary

0 1 2 3

4 5 6 7

8 9 10

15

11

12 13 14

MinimizingNumber ofNeighbors

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Stout and Jablonowski – p. 119/324

Page 128: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Matrix Decomposition Analysis

Computation proportional to area so both loadbalanced.

Squares minimize bytes communicated (parallelizationoverhead), so is generally better.

However: Recall, there is significant overhead instarting a message, especially on clusters, so farsmaller matrices may need to concentrate on number,not size, of messages, i.e., use strips.

Stout and Jablonowski – p. 120/324

Page 129: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Local vs. Global Matrices

If serial has matrix A[0 : n−1], and there are p DMprocessors, with ranks 0 . . . p−1

each processor has matrix A[0 : nlocal−1], wherenlocal = n/p

A[i] on processor p corresponds to A[i + p ∗ nlocal] inthe original array

if use A[i+1] and A[i−1] in calculation of A[i],(i 6= 0, n−1), then would have A[−1 : nlocal] to addghost cells

Stout and Jablonowski – p. 121/324

Page 130: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Linear Rank vs. 2-D Indices

To map processorranks 0..15 to

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

rows 0..3 andcolumns 0..3

For processor rank i, rowi = ⌊i/√p⌋ and coli = i− rowi ∗√

p

Right: (rowi, coli + 1), rank i + 1

Left: (rowi, coli − 1), rank i− 1

Up: (rowi − 1, coli), rank (rowi − 1) ∗ √p + coli

Down: (rowi + 1, coli), rank (rowi + 1) ∗ √p + coli

MPI “virtual topologies” can do this for you.

Stout and Jablonowski – p. 122/324

Page 131: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Graph Decompositions

Very general graph decomposition techniques can be usedwhen communication patterns less regular.

Objects (calculations) represented as vertices (withweights if calculation requirements uneven)

Communication represented as edges (with weights ifcommunication requirements uneven).

Goals:

1. assign vertices to processors to evenly distribute thenumber/weight of vertices, and

2. minimize and balance the number/weight of edgesbetween processors.

Stout and Jablonowski – p. 123/324

Page 132: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

What is Best Decomposition?

3

1

4

2

1

3

2

55 6

2

6

6 1

4

5

2

5

5

432

2

4

3

4

3

3

36

5

4

1

6

3

1

Stout and Jablonowski – p. 124/324

Page 133: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Graph Decomposition Tools

Unfortunately, optimal graph decomposition is NP-hard.

Fortunately, various heuristics work well, and high-qualitydecomposition tools are available, such as Metis .

To use a serial tool such as Metis, convert data into formatit requires, run Metis to partition the graph vertices, thenconvert to format your program requires.

Scripts (Perl, Python, etc.) useful to convert formats.

Parallel version, ParMetis, also available.

http://www.cs.umn.edu/ ˜ karypis/metis/metis.html

Stout and Jablonowski – p. 125/324

Page 134: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Using Serial Decomposition Tool

matrix

partitioned

program input format

graph

graph as sparse

Program

Parallel

Convert

Convert

Problem

Metis

Stout and Jablonowski – p. 126/324

Page 135: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Where Do Weights Come From?

If weights are static and objects of the same type haveabout the same requirements, and if types are known inadvance, then:

Sometimes all the same.

Sometimes easy to deduce a priori.

May use simple measurements on small test cases.

May use statistical curve fitting on sample problems.

If types aren’t known in advance, this won’t be useful.

Stout and Jablonowski – p. 127/324

Page 136: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Static Geometric Decompositions

When the objects have an underlying geometrical basis,such as the finite elements representing surfaces of carparts, stars in a galaxy, wires in a VLSI layout, or polygonsrepresenting census blocks in a geographical informationsystem, then the geometry can often be exploited

if communication predominately involves nearby objects.

Geometric decompositions can be based on k-D trees,quad- or oct-trees, ham sandwich theorems, space-fillingcurves, etc., and can incorporate weights.

Stout and Jablonowski – p. 128/324

Page 137: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Recursive Bisectioning

Magenta points require twice as much work as cyan ones.

Stout and Jablonowski – p. 129/324

Page 138: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Recursive Bisectioning

Split work evenly along the x-axis (weighted median).

Stout and Jablonowski – p. 130/324

Page 139: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Recursive Bisectioning cont.

Split each side along y-axis, using median on that side.

Stout and Jablonowski – p. 131/324

Page 140: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Recursive Bisectioning cont.

Now split along x-axis (or z-axis if data 3-dimensional).

Stout and Jablonowski – p. 132/324

Page 141: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Recursive Bisectioning cont.

Cycle through axes until # pieces = # processors.

Stout and Jablonowski – p. 133/324

Page 142: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Recursive Bisectioning cont.

May decide to use only 1 or 2 dimensions to split along,similar to the strip partitioning for matrices.

Closely related to the k-D tree serial data structure.

Stout and Jablonowski – p. 134/324

Page 143: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Space-Filling Curves

The best general-purpose geometric load-balancing comesfrom space-filling curves.

The order in which points are visited in the space-fillingcurve determines how the geometric objects are groupedtogether to be assigned to the processors.

Stout and Jablonowski – p. 135/324

Page 144: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

The Hilbert Space-Filling Curve

0 1

23

4

5 6

7 8

9 10

11

1213

14 15 16

17 18

19 20 21

2223

24 25

262728

2930

31

32

33 34

35 36 37

3839

40 41

424344

4546

4748

51

49

50

55 52

53545758

59 56

61

62

60

63

For an implementation, see the references.

Stout and Jablonowski – p. 136/324

Page 145: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Using A Space-Filling Curve

Letters represent work, boldface twice as much work.

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z

Stout and Jablonowski – p. 137/324

Page 146: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Step 1: Determine Space-Filling Coordinates

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z

A B C D E F G

H I J K L M N

O P Q R S T U

V W X Y Z

63

2

6

8 1923

242527

29

31

34 363839

40474849 50

525556

5853

59

Stout and Jablonowski – p. 138/324

Page 147: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Step 2: Sort by Space-Filling Coordinates

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z

A B C D E F G

H I J K L M N

O P Q R S T U

V W X Y Z

63

2

6

8 1923

242527

29

31

34 363839

40474849 50

525556

5853

59

XQW Z Y U T S V R K N M L E D C B FJ O I H P G A

Stout and Jablonowski – p. 139/324

Page 148: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Step 3: Divide Work Evenly Based on Sorted Order

A B C D

EF

G H I J K L M

NOP

Q R S

TUVW

X Y

Z

A B C D E F G

H I J K L M N

O P Q R S T U

V W X Y Z

63

2

6

8 1923

242527

29

31

34 363839

40474849 50

525556

5853

59

XQW Z Y U T S V R K N M L E D C B FJ O I H P G A

Stout and Jablonowski – p. 140/324

Page 149: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Z- Ordering

Aka Morton or shuffled bit ordering. For 2-D,

point (x2x1x0, y2y1y0)

mapped to y2x2y1x1y0x0

39

40 41

42 43

44 45

46 47

48 49

50 51

52 53

54 55

57

58 59

60 61

62 63

2

56

5

7

8 9

10 11

12 13

14 15

16 17

18 19

20 21

23

24 25

30 31

0 1 4

3

27

28 29

26

6 22

32 33

34 35

36 37

38

For 3-D, (xk . . . x1x0, yk . . . y1y0, zk . . . z1z0) → zkykxk . . . z1y1x1z0y0x0

Stout and Jablonowski – p. 141/324

Page 150: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Hilbert vs. Z

Both extend to arbitrary dimensions.

Both give regions with boundary (communications)within constant factor of optimal.

Hilbert ordering assigns only 1 contiguous region to aprocessor, Z- ordering may assign 2.

Z slightly easier to compute than Hilbert.

Hilbert can be used for surface of cube or sphere, Zdoesn’t seem to be as useful.

In practice, little difference in performance.

Stout and Jablonowski – p. 142/324

Page 151: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

High-Dimensional Data

For high dimensions Hilbert ordering requires extensivememory to store tables used to compute index.

However, often not relevant since

geometric approaches not nearly as usefulon high-dimensional data

Stout and Jablonowski – p. 143/324

Page 152: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Shared Memory Parallelization

Parallel programming on shared memory (SM) machineshas always been important in high performance computing.

All processors can access all the memory in the parallelsystem (access time can be different).

In the past: Utilization of such platforms has never beenstraightforward for the programmer.

Vendor-specific solutions via directive-based compilerextensions dominated until the mid 90’s.

Also: data parallel extensions to Fortran90, HighPerformance Fortran (HPF), but lack of efficiency.

Stout and Jablonowski – p. 144/324

Page 153: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallelization Techniques: OpenMPSince 1997: OpenMP is the new industry standard forshared memory programming.

In 2008: The OpenMP Version 3.0 specification wasreleased (new feature: task parallelism).

OpenMP is an Application Program Interface (API):directs multi-threaded shared memory parallelism⇒thread based parallelism

Explicit (not automatic) programming model: theprogrammer has full control over the parallelization,compiler interprets parallel constructs.

Based on a combination of compiler directives, libraryroutines and environment variables.

OpenMP uses the fork-join model of parallel execution.

Stout and Jablonowski – p. 145/324

Page 154: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMPOpenMP can be interpreted by most commercial Fortranand C/C++ compilers , supports all shared-memoryarchitectures including Unix and Windows platforms, andhence

should be your programming system of choice forshared memory platforms

OpenMP home page and recommended online tutorial:http://www.openmp.org

http://www.llnl.gov/computing/tutorials/openMP/

Stout and Jablonowski – p. 146/324

Page 155: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Goals of OpenMP

Standardization: standard among all shared memoryarchitectures and hardware platforms

Lean: simple and limited set of compiler directives forshared memory machines. Often significant parallelismby using just 3-4 directives.

Ease of use: supports incremental parallelization of aserial program, unlike MPI which typically requires anall or nothing approach.

Portability: supports Fortran (77, 90, 95), C (C90,C99) and C++

Stout and Jablonowski – p. 147/324

Page 156: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP: 3 Building BlocksCompiler directives (imbedded in user code) for

parallel regions (PARALLEL)parallel loops (PARALLEL DO)parallel sections (PARALLEL SECTIONS)parallel tasks (PARALLEL TASK)sections to be done by only one processor (SINGLE)synchronization (BARRIER, CRITICAL, ATOMIC,locks, etc.)data structures (PRIVATE, SHARED, REDUCTION)

Run-time library routines (called in the user code) likeOMP_SET_NUM_THREADS,OMP_GET_NUM_THREADS, etc.

UNIX Environment variables (set before programexecution) like OMP_NUM_THREADS, etc.

Stout and Jablonowski – p. 148/324

Page 157: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP: The Fork-Join Model

Parallel execution is achieved by generating threads whichare executed in parallel (multi-threaded parallelism):

OF

RK

OJ

IN

OF

RK

IOJ

Nthread

master

parallel region parallel region

Stout and Jablonowski – p. 149/324

Page 158: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP: The Fork-Join Model

Master thread executes sequentially until the firstparallel region is encountered.

FORK: The master thread creates a team of threadswhich are executed in parallel.

JOIN: When the team members complete the work,they synchronize and terminate. The master threadcontinues sequentially.

Number of threads is independent of the number ofprocessors.

Quiz: What happens if# threads or tasks > # processors# threads or tasks < # processors

Stout and Jablonowski – p. 150/324

Page 159: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP: Work-sharing ConstructsDO/for loops: type of “data parallelism”

SECTION: breaks work into independent sections thatare executed concurrently by a thread (“functionalparallelism”), units of work are statically defined atcompile time

TASK: breaks work into independent tasks that areexecuted asynchronously in the form of dynamicallygenerated units of work (“irregular parallelism”),

SINGLE: serializes a section of the code. Useful forsections of the code, that are not threadsafe (I/O).

OpenMP recognizes compiler directives that start with!$OMP (in Fortran)#pragma omp (in C/C++)

Stout and Jablonowski – p. 151/324

Page 160: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP: Work-sharing Constructs

Fork Fork Fork

Join Join Join

team teamDO/for loop

DO/for loop SECTIONS SINGLE

master threadmaster thread

master thread master thread

No barrier upon entry to these constructs, but impliedbarrier (synchronization) at the end of each⇒functionality of the OpenMP directive !$OMP BARRIER

Stout and Jablonowski – p. 152/324

Page 161: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Loops (1)

⇒ in Fortran notation

!$OMP PARALLEL DO

DO i = 1, na(i) = b(i) + c(i)

END DO

!$OMP END PARALLEL DO

Stout and Jablonowski – p. 153/324

Page 162: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Loops (2)

Each thread executes a part of the loop.

By default, the work is evenly and continuously dividedamong the threads⇒ e.g. 2 threads:

thread 1 works on i = 1 . . . n

2

thread 2 works on i = (n

2+ 1) . . . n

The work (number of iterations) is statically assigned tothe threads upon entry to the loop.

Number if iterations cannot be changed during theexecution.

Implicit synchronization at the end, unless “NOWAIT”clause is specified.

Highly efficient, low overhead.

Stout and Jablonowski – p. 154/324

Page 163: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Sections (1)

⇒ in Fortran notation

!$OMP PARALLEL SECTIONS

!$OMP SECTIONDO i = 1, n

a(i) = b(i) + c(i)END DO

!$OMP SECTIONDO i = 1, k

d(i) = e(i) + e(i-1)END DO

!$OMP END PARALLEL SECTIONS

Stout and Jablonowski – p. 155/324

Page 164: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Sections (2)

The two independent sections can be executedconcurrently by two threads.

Units of work are statically defined at compile time.

Each parallel section is assigned to a specific thread,executes work from start to finish.

Thread cannot suspend the work.

Implicit synchronization unless “NOWAIT” clause isspecified.

Nested parallel sections are possible, but can becostly due to high overhead of parallel regioncreation.difficult to load balance, possibly unneeded sync.therefore: impractical

Stout and Jablonowski – p. 156/324

Page 165: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Tasks (1)

Main change in OpenMP 3.0 (May 2008)

Allows to parallelize irregular problems likeunbounded loops (e.g. while loops)recursive algorithms

Unstructured parallelism

Dynamically generated units of work

Task can be executed by any thread in the team, inparallel with others

Execution can be immediate or deferred until later

Execution might be suspended and continued later bysame or different thread

Stout and Jablonowski – p. 157/324

Page 166: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Tasks (2)Example: Pointer chasing in C notation

#pragma omp parallel{

#pragma omp single{

p = listhead ;while (p) {

/* create a task for each element of the list */#pragma omp taskprocess (p) ; /* process the list element p */p=next(p);}

}}

Stout and Jablonowski – p. 158/324

Page 167: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Tasks (3)

Single construct ensures that only one threadtraverses the list

Single thread encounters task directive and invokesthe independent tasks

“Task” construct gives more freedom for scheduling,can replace loops with if statements that are not wellload-balanced

Parallel tasks can be nested within parallel loops orsections

Stout and Jablonowski – p. 159/324

Page 168: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Loops and Scope of Variables

Parallel DO loops (“for” loops in C/C++) are often themost important parallel construct.

The iterations of a loop are shared across the team(threads).

A parallel DO construct can have different clauses likeREDUCTION .

sum = 0.0!$OMP PARALLEL DO REDUCTION(+,sum)

DO i = 1, nsum = sum + a(i)

END DO

!$OMP END PARALLEL DOStout and Jablonowski – p. 160/324

Page 169: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Loops and Load Balancing

Example of a parallel loop with dynamic load-balancing:

!$OMP PARALLEL DO PRIVATE(i,j), SHARED(X,N),!$OMP& SCHEDULE (DYNAMIC,chunk)

DO i = 1, nDO j = 1, i

x(i) = x(i) + jEND DO

END DO

!$OMP END PARALLEL DO

Stout and Jablonowski – p. 161/324

Page 170: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Loops and Load Balancing

Iterations are divided into pieces of size chunk.

When a thread finishes a piece, it dynamically obtainsthe next set of iterations.

DYNAMIC scheduling improves the load balancing,default: STATIC .

Tradeoff: Load Balancing and OverheadThe larger the chunk, the lower the overhead.The smaller the size (granularity), the better thedynamically scheduled load balancing.

Stout and Jablonowski – p. 162/324

Page 171: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

New in OpenMP 3.0: Loop Collapsing

Loops can be collapsed via the clause COLLAPSE

!$OMP PARALLEL DO COLLAPSE(2)

DO k = 1, pDO j = 1, m

DO i = 1, nx(i,j,k) = i*j + k

END DOEND DO

END DO

!$OMP END PARALLEL DO

Stout and Jablonowski – p. 163/324

Page 172: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Loop Collapsing

Iteration space from the two loops is collapsed into asingle one

Good ifloops k and j do not depend on each other (norecursions)execution order can be interchangedloop limits p and m are small, #processors is large

Rules:perfectly nested loops (j loop immediately follows kloop)rectangular iteration space (m independent of p)

Stout and Jablonowski – p. 164/324

Page 173: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Quiz: Is there something wrong ?

Assume: 4 parallel shared memory threads, all arrays andvariables are initialized.

! start the parallel region!$OMP PARALLEL PRIVATE(pid), SHARED(a,b,n)! get the thread number (0..3)pid = OMP_GET_THREAD_NUM()! parallel loop!$OMP DO PRIVATE(i)DO i = 1, n

A(pid) = A(pid) + B(i) ! computeEND DO!$OMP END DO! end the parallel region!$OMP END PARALLEL

Stout and Jablonowski – p. 165/324

Page 174: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

False Sharing Example

Suppose you have P shared memory processors, withpid = 0 . . . P-1

Each processor runs the Fortran code:DO i = 1, n

A(pid) = A(pid) + B(i)END DO

No read nor write (load and store) conflicts, since notwo processors read or write same element, but:

Performance is horrible!

Stout and Jablonowski – p. 166/324

Page 175: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

False Sharing Example

Reason:

Several consecutive elements of A are stored in samecache line.

In each iteration, each processor gets an exclusive copyof entire cache line to write to, all other processors mustwait.

B read-only, so sharing not a problem.

⇒ Can be avoided by declaring A(c,0:P-1), where c elementsequal 1 cache line, and using A(1,pid).

False sharing is usually obvious once pointed out, but veryeasy to write in and overlook. Avoid!

Stout and Jablonowski – p. 167/324

Page 176: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

False Sharing Example

lines (not shared)

��

��

������������������������

������������������������

��������������

��������������

����

��������������������������

��������������������������

��������

�����������������������

�����������������������

c different cache

2D: A(c,0:P−1)

1D: A(0:P−1) same cache line:cache conflicts

��

Stout and Jablonowski – p. 168/324

Page 177: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Race Conditions

In a shared memory system, one common cause of errorsis when a processor reads a value from a memory locationthat has not yet been updated.

This is a race condition , where correctness dependson which processor performed its action first.

Often hard to debug because the debugger often runsthe program in a serialized, deterministic ordering.

To insure that “readers” do not get ahead of “writers”,process synchronization is needed.

DM systems: messages are often used to synchronize,with readers blocking until the message arrives.

Shared memory systems: barriers, softwaresemaphores, locks or other schemes are used.

Stout and Jablonowski – p. 169/324

Page 178: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Race Condition Example

Two PARALLEL SECTIONS :

!$OMP PARALLEL SECTIONS

!$OMP SECTIONA = B + C

!$OMP SECTIONB = A + C

!$OMP END PARALLEL SECTIONS

Unpredictable results since the execution order matters.

Program will not fail: Wrong answers without a warningsignal!

Stout and Jablonowski – p. 170/324

Page 179: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP: Traps

OpenMP is a great way of writing fast executingcode and your gateway to special painful errors.

OpenMP threads communicate by sharing variables.

Variable Scoping: Most difficult part of shared memoryparallelization

Which variables are sharedWhich variables are private

If using libraries: Use the threadsafe library versions.

Avoid sequential I/O (especially when using a singlefile) in a parallel region: Unpredictable order.

Stout and Jablonowski – p. 171/324

Page 180: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP: Traps

Common problems are:

False sharing: Two or more processors accessdifferent variables that are located in the same cacheline. At least one of the accesses is a “write” whichinvalidates the entire cache line.

Race condition: The program’s result changes whenthreads are scheduled differently.

Deadlock: Threads lock up waiting for a lockedresource that will never become available.

Stout and Jablonowski – p. 172/324

Page 181: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Something to think about over the break

Question: How would you distribute the work in a climatemodel ?

Latit

ude

Longitude

South Pole

Equator

North Pole

Stout and Jablonowski – p. 173/324

Page 182: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

HYBRID COMPUTING

Many of today’s most powerful computers employ bothshared memory (SM) and distributed memory (DM)architectures.

These machines are so-called hybrid computers.

The corresponding hybrid programming model is acombination of shared and distributed memoryprogramming (e.g. OpenMP and MPI).

Today: hybrid architectures are dominant at the highend of computing.

In the future: the hybrid memory architecture is likely toprevail despite popular DM machines like IBM’s “BlueGene”.

Stout and Jablonowski – p. 174/324

Page 183: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Memory Systems: Distributed Memory

All memory is associated with processors.

To retrieve information from another processor’smemory a message must be sent over the network.

Advantages:Memory is scalable with number of processorsEach processor has rapid access to its own memorywithout interference or cache coherency problemsCost effective: can use commodity parts

Disadvantages:Programmer is responsible for many of the details ofthe communicationMay be difficult to map the data structureNon-uniform memory access (NUMA)

Stout and Jablonowski – p. 175/324

Page 184: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Memory Systems: Shared Memory

Global memory space, accessible by all processors

Memory space may be all real or may be virtual

Consistency maintained by hardware, software or user

Advantages:Global address space is user-friendly, algorithm mayuse global data structures efficientlyData sharing between tasks is fast

Disadvantages:Maybe lack of scalability between memory andCPUs. Adding more CPUs increases traffic onshared memory - CPU pathUser is responsible for correct synchronization

Stout and Jablonowski – p. 176/324

Page 185: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Hybrid Memory Architecture

The shared memory component is usually a cachecoherent (CC) SMP node with either uniform(CC-UMA) or non-uniform memory access (CC-NUMA)

CC: If one processor updates a variable in sharedmemory, all the other processors on the SMP nodeknow about the update.

The distributed memory component is a cluster ofmultiple SMP nodes .

SMP nodes can only access their own memory, not thememory on other SMPs.

Network communication is required to move data fromone SMP node to another.

Stout and Jablonowski – p. 177/324

Page 186: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Hybrid Memory Architecture

CPUCPU

CPU CPU

CPUCPU

CPU

CPUCPU

CPU CPU

CPUCPUCPU

CPUCPUMemory

Network

Memory

Memory Memory

SMP nodeSMP node

SMP node SMP node

CPU: single−core or multi−core technology possible

Multi-core (dual- or quad-core) chips common, even inlaptops

Typical: Several multi-core chips form an SMP node.Stout and Jablonowski – p. 178/324

Page 187: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Multi-Cores and Many-CoresGeneral trend in processor development: multi-core tomany-core with tens or even hundreds of cores

AdvantagesCost advantage.Proximity of multiple CPU cores on the same die,signal travels less, high CC clock rate.

Disadvantages:More difficult to manage thermally than lower-densitysingle-chip design.Needs software (e.g. OS, commercial) support.Multi-cores share system bus and memorybandwidth: limits performance gain. E.g. ifsingle-core is bandwidth-limited, the dual core is only30%-70% more efficient.

Stout and Jablonowski – p. 179/324

Page 188: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Dual Level ParallelismOften: Applications have two natural levels of parallelism.Take advantage of it and exploit the shared memoryparallelism by using OpenMP on an SMP node. Why?

MPI performance degrades whendomains become too smallmessage latency dominates computationparallelism is exhausted

OpenMPtypically has lower latencycan maintain speedup at finer granularity

Drawback:

Programmer must know MPI and OpenMP

Code might be harder to debug, analyze and maintainStout and Jablonowski – p. 180/324

Page 189: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Hybrid Programming Model

Combination of distributed and shared memoryprogramming models, e.g.:

MPI and OpenMPMPI and High Performance Fortran (HPF)MPI and POSIX Threads

Most important: MPI and OpenMPMany MPI processesEach MPI process is assigned to different SMP nodeExplicit message passing between the nodesShared memory parallelization within an SMP nodeEach MPI process is therefore a multithreadedOpenMP processCan give better scalability than pure MPI or OpenMP

Stout and Jablonowski – p. 181/324

Page 190: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Hybrid Programming Strategy

Decompose the computational domainMost often: Domain decompositionAlternatively: Functional decomposition

Distribute the partitions among the SMP nodes (coarsegrain parallelism).

Use MPI to communicate the ghost regions orinterfaces of each partition.

Add OpenMP for loop-level parallelism within a partitionon the SMP node (fine grain parallelism).

Let one OpenMP thread speak for all.

Stout and Jablonowski – p. 182/324

Page 191: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Hybrid Programming Strategy

Recommended:Limit MPI communication to serial OpenMP part(outside a parallel region)Let the master thread (serial OpenMP part)communicate via MPI messages.

Stout and Jablonowski – p. 183/324

Page 192: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

VECTOR PARALLEL COMPUTING

Principles behind vector parallel computing

Vector pipeline

Pipelining and modern scalar processors

Characteristics of vector computers

Load Balancing and Grid Partitioning Strategies

Graphics Processing Units (GPUs)

Stout and Jablonowski – p. 184/324

Page 193: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector Computers: Trend

What are the trends in high performance computing ?

Stout and Jablonowski – p. 185/324

Page 194: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector Computers: TrendWorldwide: vector computers became less and lesscommon over the last 15 years

In 2008: NEC and Cray remain in this market

Powerful vector architecture:41 TFlop/s NEC SX-6 system (peak performance):Earth Simulator (Japan, #49 TOP500 list in 6/2008,#20 in 6/2007, #1 from 2002-2004)NEC SX-9, theoretical peak performance 839TFlop/sCray XT5h (newest installation in Edinborough),hybrid architecture with X2 vector processing node

Extreme sustained performance: Earth Simulatorsystem reaches approx. 90% of its peak performance(Linpack benchmark)

Stout and Jablonowski – p. 186/324

Page 195: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector Processing - Pipelining Principle

Principle: Split an operation into independent parts &execute them concurrently in specialized pipelines

Example: Add pipeline

DO I = 1, 1000C(I) = A(I) + B(I)

ENDDO

Independent steps:compare and normalize exponentsadd mantissaenormalize resulterror handling (overflow/underflow)

Stout and Jablonowski – p. 187/324

Page 196: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector Pipelines: Example (cont.)

1

1

1

1

2

2

2

2

3

3

3

3

3

4

4

4

4

4

3 4 5

5

5

5

5

51 2

21

Startup phase

add mantissae

compare exponents

check errors

normalize results

normalize

Streaming phase

Two phases:Startup phase (fill the pipeline)Streaming phase (1 result per clock cycle)

Stout and Jablonowski – p. 188/324

Page 197: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector Processing - Principles

SIMD principle: One instruction works on a data stream(vector).

Vector: A vector consists ofdata that lie consecutively in memory (ideal case)data with constant stridedata with random access (gather & scatteroperations)

Pitfall: Non-consecutive memory accesses can lead tomemory bank conflicts and performance losses.

Stout and Jablonowski – p. 189/324

Page 198: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Principles (cont.)

Pipelining: The functional units are divided into independentsegments which work simultaneously.

Add pipelineMultiply pipelineMultifunctional pipeline, e.g. multiply and addLogic pipelineLoad/Store pipelineInstruction pipeline

Stout and Jablonowski – p. 190/324

Page 199: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Pipelines and Modern Scalar Processors

The pipelining principle: basis for all vector machinesand GPUs.

But pipelines are also used in modern scalarprocessors⇒ speed up execution

Examples:

IBM Power6 CPU: Floating point units (FPU) which canissue a combined multiply/add

a = b* c + c

Multi-functional hardware unitIn addition: data prefetch capabilities (“loadpipeline”)

Stout and Jablonowski – p. 191/324

Page 200: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector processing - Hardware differences

Scalar:

Memory

scalar

CPU

data addresses

Stout and Jablonowski – p. 192/324

Page 201: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector processing - Hardware differences

Vector:

Memory

...Bank n-1

Bank 0 mod(Addr.,n)Bank =

vector scalardata addresses

CPU

Stout and Jablonowski – p. 193/324

Page 202: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector Processing - Features

The new hardware/software features are:

Vector unit: “co-processor” to scalar unit

Pipeline sets

Vector registers that provide data streams

Interleaving memory banks: quick memory access

(Often) no data cache for vector unit

Software & hardware interface: vector instructions

Vectorizing compiler

“Break Even Point” is hardware dependent (vectorlength that lets the vector unit outperform the scalarunit)

Stout and Jablonowski – p. 194/324

Page 203: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vector Processing - Features

The performance of the vector unit depends on the vectorlength (number of operations):

number of operations

perf

orm

ance

n1/2 n

In general: long vectors boost the performance

Startup time becomes negligible with increasing n

Stout and Jablonowski – p. 195/324

Page 204: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Load-Balancing & Grid Partitioning

Left: Fragmented 2D grid partitioning good forload-balancing, but short vectorsRight: good vectorization (long vectors), but possibly badload-balancing properties

Stout and Jablonowski – p. 196/324

Page 205: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Load-Balancing & Grid Partitioning

The more processors run the simulation the smaller arethe partitions, the smaller is the vector length on eachprocessor.

From a computational standpoint: partitioning strategyon the left is well-load-balanced (e.g. day/night sides ina weather model have different workloads and arewell-distributed).

From a numerical performance standpoint: distributionon the right is more efficient (longer vectors), but suffersfrom load imbalances.

⇒ : In case of uneven workloads balance must be foundbetween long vectors and fragmented load balancingstrategy.

Stout and Jablonowski – p. 197/324

Page 206: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Vector Computing

Parallel vector computers are powerful for scientificapplications:

sustained performance can reach more than 30% ofthe peak performancecompare: on MPP machines approx. 10-20% of thepeak performance is reached (optimistic)

Single processor performance on vector machines is amultiple of any scalar processor.

Computations need smaller number of parallel vectorprocessorsAdvantageous if application does not scale well tolarge number of parallel CPUs

Stout and Jablonowski – p. 198/324

Page 207: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Vector Computing (cont.)

Parallel vector machines become most effective forlarge application that require identical (arithmetic)instructions on streams of data.

The vector performance strongly depends onVector length The longer the more effective !Data access Consecutive data access outperforms

indirect addressing and data with constant stride.Number of operations The more arithmetic operations

can be performed at once the more effective thevector unit (enables chaining).

Stout and Jablonowski – p. 199/324

Page 208: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Graphics Processing Units (GPUs)

Newest trend in high-performance computing.

Traditionally: GPU dedicated graphics rendering devicefor a personal computer, workstation or game console.

GPUs have a parallel many-core architecture, eachcore capable of running thousands of threadssimultaneously, exploit SIMD fine-grain parallelism

Highly parallel structure makes them more effectivethan general- purpose CPUs for a range of complex(highly specialized) algorithms.

Stout and Jablonowski – p. 200/324

Page 209: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Graphics Processing Units

Trend: Highly diverse computing platforms can includemulti-cores, SMP nodes, graphics accelerators orclassical vector units as co-processors for boththread-based and process-based parallelism.

GPUs are cheap: commodity co-processors producedin the millions.

Very fast, first 1 TFlop/s GPU was out in February 2008

#1 computer on Top 500 list: “Roadrunner” utilizesGPUs as accelerators: IBM’s GPU Cell processororiginally designed for the Sony Playstation 3.

Stout and Jablonowski – p. 201/324

Page 210: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Stout and Jablonowski – p. 202/324

Page 211: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

GPUs — Future?

Extremely difficult to use the hardware effectively.

For example: NVIDIA’s GeForce GPU seriesprogrammed in CUDA (Compute Unified DeviceArchitecture): compiler and set of development tools(variation of C).

Big question: What is the lifetime of these systems? Isit worth investing into user software?

Need robust hardware: Error trapping, IEEEcompliance, hardware performance counters, circuitsupport for synchronizations.

Need robust compilers and programming standards.

Will it attract new sources of talent to supercomputing?

Stout and Jablonowski – p. 203/324

Page 212: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

PARALLELIZATION II

Here we examine some of the more complicated aspects ofsuccessfully parallelizing large programs.

Stout and Jablonowski – p. 204/324

Page 213: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Problems Verifying Correctness

Proving parallel and serial programs equivalent typicallyonly possible if the parallelization automated (such as aparallelizing compiler).

Thus usually resort to testing on selected inputs.

Sensitivity & efficiency at discovering errors can bemagnified by examining intermediate results, ratherthan just final results.

Stout and Jablonowski – p. 205/324

Page 214: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

However, problems remain:

Coverage: Need to test all program options.

Time: Some conditions only appear after a significantamount of computation.

Detection: Often simple “diff” won’t work, hard todifferentiate between errors and roundoff caused bychanged order of arithmetic operations. Some usersuncomfortable with slight machine variations.

Stout and Jablonowski – p. 206/324

Page 215: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Some Solutions

Coverage: Typically requires coordination with applicationexpert, and careful analysis. There are tools to checkcoverage.

Time: Checkpoint/restart can help. Also very useful forlong-term maintenance and for production runs so thatwork not lost if system fails during a long run.

Detection: Use of IEEE arithmetic helps cross-platformcomparisons. Also, by being careful can insure that theparallel and serial programs perform all calculations(such as summations) in the same order, but usuallythis lowers efficiency.

Stout and Jablonowski – p. 207/324

Page 216: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Performance Problems

Detailed profiling of the crash code showed that there weremany places where efficiency was unacceptable.

Often cache utilization was very poor.

Load balance difficult due to heterogeneous elementswith time-varying requirements.

Contact adds dynamic computational andcommunication imbalance.

Some of the collective communication routines were tooslow.

I/O was substantial, and was often inefficient.

Stout and Jablonowski – p. 208/324

Page 217: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Profiling

Profiling proceeded in stages, identifying whereefficiency was too low.

For targeted section, profiled uniprocessorperformance, such as cache misses.

Also profiled load imbalance and communicationoverhead, proceeding from smaller systems to largerones (when needed).

Incremental approaches kept the amount of datacollected at manageable reasonable levels.

Unfortunately, when were doing this there were nostandard tools, we had to build several. Situation muchbetter now, discussed later.

Stout and Jablonowski – p. 209/324

Page 218: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Utilizing the Memory Hierarchy

Effective use of cache and locality often critical forachieving high performance.

Often uniprocessor performance can be doubled byrestructuring data structures and computations toexploit cache

Unfortunately, many data structures and algorithms usepointers and indirect addressing, diminishing the abilityof the compiler to optimize cache usage.

Later we’ll describe a data structure (adaptive blocks)that addressed this

Stout and Jablonowski – p. 210/324

Page 219: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Cache Misses

Many programs have excessive loads and stores, causingcache misses which slow the program. Can often bereduced by rearranging the code and/or data structure.

For example, in Fortran

do i=1,n do j=1,ndo j=1, n do i=1,n

A[i,j]=A[i,j]+1 vs A[i,j]=A[i,j]+1enddo enddo

enddo enddo

For large arrays, which is faster, and why?

Stout and Jablonowski – p. 211/324

Page 220: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Utilizing the Compiler

For a well-structured program it should be possible for thecompiler to generate good code — optimizing cacheutilization, reducing instruction counts, etc. However,extensive optimization is not the default. Thus

Turn on appropriate compiler optimization options.

Usually “O” option important, but often others needed aswell. These affect data placement as well as codegeneration.

Stout and Jablonowski – p. 212/324

Page 221: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Utilizing the Compiler

For a well-structured program it should be possible for thecompiler to generate good code — optimizing cacheutilization, reducing instruction counts, etc. However,extensive optimization is not the default. Thus

Turn on appropriate compiler optimization options.

Usually “O” option important, but often others needed aswell. These affect data placement as well as codegeneration.

May need a guru to get best combination of options for yourprogram+machine combination.

Stout and Jablonowski – p. 212/324

Page 222: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

LOAD-BALANCING REVISITED

We’ll continue the discussion of load-balancing, looking atsome more complicated problems.

Stout and Jablonowski – p. 213/324

Page 223: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Loop Dependencies

Recall that if the value of variable B depends upon the valueof variable A, then there is a dependency between A and B.

Loops often introduce real, or apparent, dependencies.

For example,

do i=1,nV[i]=V[i] − 2*V[i−1]

enddo

The loop cannot be vectorized nor parallelized, becauseeach value depends upon value from previous iteration.

Stout and Jablonowski – p. 214/324

Page 224: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

To parallelize

do i=1,nV[i]=V[i] − 2*V[i+1]

enddo

need to copy V and use copy to compute new values.

Stout and Jablonowski – p. 215/324

Page 225: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

To parallelize

do i=1,nV[i]=V[i] − 2*V[i+1]

enddo

need to copy V and use copy to compute new values.

W=Vdo i=1,n

V[i]=W[i] − 2*W[i+1]enddo

Stout and Jablonowski – p. 215/324

Page 226: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

To parallelize

do i=1,nV[J[i]]=i

enddo

need to know if J is 1-1.

Some automatic parallelizers can handle the previous loop,but none can do this one without programmer assistance.

Stout and Jablonowski – p. 216/324

Page 227: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Time Troubles

Parallelization problems of time-like variables includes:

Partial Differential Equations Time is explicit.

Divide and Conquer “Size” similar to time. Subproblems maynot be known in advance, and need to be generated inorder.

Branch and Bound Branching control is often serialized.

Discrete Event Simulation Time is usually explicit, may beincremented adaptively, and subproblems often notknown in advance.

Depth-First Search Search decisions made sequentially.Theoretical computer science: some versions of DFS are NC-complete

Stout and Jablonowski – p. 217/324

Page 228: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Not Everything is As Bad As It Seems

Some things look serial but can be easily parallelized.

Reduction

x← 0do i← 0, n-1

x← x + a[i]enddo

Scan or Parallel Prefix

y[0]← a[0]do i← 1, n-1

y[i]← y[i-1] + a[i]enddo

Stout and Jablonowski – p. 218/324

Page 229: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallelized Reduction Operations

Reduction and scan operations are extremely common.

They are recognized by parallelizing compilers andimplemented in MPI and OpenMP.

They can be parallelized by using associativity of thecombining operator ( + in this case) i.e.,

a + (b + c) = (a + b) +c

In some situations one also uses commutativity, i.e.,a + b = b + a

Stout and Jablonowski – p. 219/324

Page 230: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Calculation Tree

a[0] a[1] a[2] a[3] a[4] a[5]a[6] a[7]

+ + + +

+ +

+

Stout and Jablonowski – p. 220/324

Page 231: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Static Load Imbalance — Correlation

Suppose have digital image, need to determine types ofvegetation on the island. Easy load-balance:

0 1 2 3

4 5 6 7

8 9 10 11

12 1413 15

Stout and Jablonowski – p. 221/324

Page 232: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

However ...

If pixel is water can quickly dismiss it, otherwise need tocarefully analyze pixel and neighbors.

Stout and Jablonowski – p. 222/324

Page 233: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

However ...

If pixel is water can quickly dismiss it, otherwise need tocarefully analyze pixel and neighbors.

Drat! We know the weights, but don’t know where the easyor hard pixels are until we’ve started processing the image.

Stout and Jablonowski – p. 222/324

Page 234: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

However ...

If pixel is water can quickly dismiss it, otherwise need tocarefully analyze pixel and neighbors.

Drat! We know the weights, but don’t know where the easyor hard pixels are until we’ve started processing the image.

Especially problematic because large regions will be of onetype or the other. Thus some processors will take muchlonger than others.

Stout and Jablonowski – p. 222/324

Page 235: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Scattered Decomposition

Used when there is a structured domain space (e.g., animage) and the processing requirements are clustered,such as modeling a crash or processing an image with onlya few items of interest.

Suppose there are P processors. Cover the problemdomain with non-overlapping copies of a grid of size P andassign each processor a cell in each of the grids.

Stout and Jablonowski – p. 223/324

Page 236: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Scattered Work

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Stout and Jablonowski – p. 224/324

Page 237: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

How Much Scattering?

More pieces⇒

⇓ load imbalance, i.e., ⇓ calculation time

⇑ overhead and/or communication time

Deciding a good tradeoff may require some timingmeasurements.

Stout and Jablonowski – p. 225/324

Page 238: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

How Much Scattering?

More pieces⇒

⇓ load imbalance, i.e., ⇓ calculation time

⇑ overhead and/or communication time

Deciding a good tradeoff may require some timingmeasurements.

However, if nearby objects have uncorrelated computationalrequirements then this method is no better than standarddecomposition, and adds overhead.

Stout and Jablonowski – p. 225/324

Page 239: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Overdecomposition

Scattered decomposition and its close relatives striping andround robin allocation are examples of a general principle:

Overdecomposition: break task into more piecesthan processors, assign many pieces to eachprocessor.

Overdecomposition underlies several load-balancing andparallel computing paradigms.

However, there can be difficulties when synchronization isinvolved.

Stout and Jablonowski – p. 226/324

Page 240: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

The (Teaching) Value of Coins

Task times are random variables, where the time isgenerated by flipping a coin until a head appears.

Your task times:

Class task times:

Your total:

Class total:

Slowest person’s total:

Stout and Jablonowski – p. 227/324

Page 241: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Synchronization and Imbalance

Suppose have p processors and n ≥ p tasks. Supposetasks take time i with probability 2−i, and there is no way totell in advance how long the task will take.

If each processor does 1 task and then waits for allprocessors to complete before going on to the next, theefficiency is low. In fact, it grows as the log of the number ofprocessors.

To improve efficiency, each processor needs to completeseveral tasks before synchronizing.

Stout and Jablonowski – p. 228/324

Page 242: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Geometric Task Times

No. Efficiency Tasks/Proc.Proc. 1 Task per to achieve

Processor Efficiency0.8 0.9

4 0.57065 10 4616 0.37193 30 13764 0.27233 53 243

256 0.21423 78 3551024 0.17647 103 468

Stout and Jablonowski – p. 229/324

Page 243: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Another Example

Tasks: 1 time unit with prob 0.9, 10 units with prob. 0.1

No. Efficiency Tasks/Proc.Proc. 1 Task per to achieve

Processor Efficiency0.8 0.9

4 0.46397 36 17916 0.22803 112 53664 0.19020 199 949

256 0.19000 291 13841024 0.19000 385 1824

Stout and Jablonowski – p. 230/324

Page 244: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Note that one can keep the efficiency high by assigningmany tasks per processor before synchronizing, but thenumber required grows with the number of processors.

Later we’ll see a technique to improve this situation.

Stout and Jablonowski – p. 231/324

Page 245: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Dynamic Data-Driven

For many data dependent problems dynamic versions alsooccur, such as

For PDEs an adaptive grid can be used instead of afixed grid, allowing one to focus computations onregions of interest.

A simulation may track objects through a region.

Computational requirements of objects may changeover time.

In such situations, some processors may becomeoverloaded.

Stout and Jablonowski – p. 232/324

Page 246: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Must balance load and need to take locality ofcommunication into account. Some options:

Locally adjust partitioning, such as moving small regionon boundary of overloaded processor to processorcontaining the neighboring region.

Use a parallel rebalancing algorithm that takes currentlocation into account (not standard).

Rerun the static load-balancing algorithm andredistribute work (ignores locality, but easier)

Warning: Need more complex data structures which canmove pieces and keep track of neighbors, etc. These aredifficult to program and debug.

Stout and Jablonowski – p. 233/324

Page 247: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Dynamic Graph Decomposition

One could rerun Metis at periodic intervals, or periodicallymeasure some metric to determine if processor loads toouneven, and if so then call Metis.

However, more efficient to use the ParMetis package whichruns in parallel.

Stout and Jablonowski – p. 234/324

Page 248: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Example: Dynamic Geometry

Adaptive blocks, useful for adaptive mesh refinement(AMR), dynamic geometric modeling. Grids broken intoblocks of fixed extents, when needed blocks refined intochildren with same extents. [Stout 1997, MacNeice et al. 2000]

refine

coarsen

Stout and Jablonowski – p. 235/324

Page 249: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Adaptive Block Properties

Whenever refine/coarsen occurs, must adjust pointers onall neighbors, no matter what processor they are on.

Using blocks, instead of cells, reduces the number ofchanges.

Same work per block, good work/communication ratio, sooften just balancing blocks per processor suffices. Ifcommunication excessive use space-filling curve.

In either case, rebalancing requires only simple collectivecommunication operations to decide where blocks go.

Stout and Jablonowski – p. 236/324

Page 250: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Load-balancing Strategies

Example: Tracer transport problems with adaptivemesh refinement (AMR) techniques

Simple load-balancing algorithm:Equal workload regardless of the location of the data

Advanced load-balancing algorithms:Load-balancing with METISLoad-balancing with a Space Filling Curve (SFC)

⇒ In the examples:

Each color represents a processor.

The amount of work in each box is the same.

Stout and Jablonowski – p. 237/324

Page 251: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Simple Load-balancing Strategy

-90

-45

0

45

90

Lat

itude

0 90 180 270 360Longitude

MovieStout and Jablonowski – p. 238/324

Page 252: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Simple Load-balancing Strategy cont.

Data distribution at model day 3:

-90

-45

0

45

90

Lat

itude

0 90 180 270 360Longitude

Stout and Jablonowski – p. 239/324

Page 253: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Simple Load-balancing Strategy cont.

Data distribution at model day 12:

-90

-45

0

45

90

Lat

itude

0 90 180 270 360Longitude

Stout and Jablonowski – p. 240/324

Page 254: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Dynamic Load-balancing with METIS

MovieCourtesy of Dr. Joern Behrens, Alfred-Wegener-Institute,Bremerhaven, Germany

Stout and Jablonowski – p. 241/324

Page 255: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Dynamic Load-balancing with SFC

MovieCourtesy of Dr. Joern Behrens, Alfred-Wegener-Institute,Bremerhaven, Germany

Stout and Jablonowski – p. 242/324

Page 256: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Comparison of Strategies

Relative behavior similar to static load-balancing behavior.Very important that rebalance operations have low overheadsince they will be done often.

Easiest strategy — just balance work/processormight be sufficient if application is dominated bycomputation, but not if communication important

Load-balancing with METIS or ParMETISgood load-balancing, decent comm. reduction,applicable to many problems

Load-balancing with Space Filling Curvesfor geometric problems usually the best choice

Stout and Jablonowski – p. 243/324

Page 257: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Dynamic, Data Driven, Min. Comm.

Sometimes work created on the fly with little advanceknowledge of tasks.

E.g„ branch-and-bound generates dynamic partialsolution trees where subproblem communicationconsists of maintaining a current best solution andseeing if subproblem already solved.

In such situations can maintain a queue of tasks(objects, subproblems) and assign to processors asthey finish previous tasks (e.g., overdecomposition).

Stout and Jablonowski – p. 244/324

Page 258: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Example: Work Preassigned

Each processor is assigned 4 tasks.

Processor Task Label/Time Total1 a/5 b/1 c/1 d/4 112 e/1 f/4 g/2 h/1 83 i/2 j/1 k/5 l/1 94 m/1 n/3 o/1 p/1 65 q/1 r/1 s/2 t/2 66 u/3 v/4 w/2 x/3 12

Max 12

Time required: 12.

Stout and Jablonowski – p. 245/324

Page 259: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Manager/Worker (Master/Slave) (prof/grad student)

Manager

worker

worker

workerworker

Task Queue

assign tasks

task donerequestanother

Stout and Jablonowski – p. 246/324

Page 260: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Work Assigned via Queue

Assign tasks a, b, c, ... to processors as the processorbecomes available:

Processor Time / task assigned1 2 3 4 5 6 7 8 9 10

1 a a a a a r v v v v2 b g g k k k k k3 c h j l n n n w w4 d d d d o s s x x x5 e i i m p t t6 f f f f q u u u

Time: 10. Adaptive allocation can improve performance.

Stout and Jablonowski – p. 247/324

Page 261: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Work Assigned via Ordered Queue

Sort in decreasing order, assign to processors as theybecome available. a k d f v n u x g i s t w b c e h j l m o p q r

Processor Time / task assigned

1 2 3 4 5 6 7 8 9

1 a a a a a s s e o2 k k k k k t t h p3 d d d d x x x j q4 f f f f g g w w r5 v v v v i i b l6 n n n u u u c m

Time: 9. The more you know, the better you can do.Unfortunately, rarely have this information.

Stout and Jablonowski – p. 248/324

Page 262: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Queueing Costs

Single-queue multiple-servers (manager/workers) mostefficient queue structure (e.g., airline check-in lines).

However, queuing imposes communication overhead,yet another tradeoff, now cost of moving task versuscost of solving it where it is generated.

Parallel computing has too many “however”s!

However, if it was too easy, you wouldn’t need this tutorial

Stout and Jablonowski – p. 249/324

Page 263: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Queueing Bottleneck

Sometimes the manager is a bottleneck. Can ameliorate

“Chunk” tasks to reduce overhead. May use largechunks initially, then decrease them near the end tofine-tune load balance.

Use distributed queues, perhaps withmultiple manager/worker subteams, with somecommunication between managersevery worker is also a manager, keeping some tasksand sending extras to others. Many variations ondeciding when/where to send work.

Stout and Jablonowski – p. 250/324

Page 264: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

OpenMP Load-Balancing

The previous descriptions had a distributed memory flavor,though they also work well for shared memory.

However, shared memory has additional options. OpenMPloop work-sharing constructs require little programmereffort. With the SCHEDULE option can specify

STATIC: simple, suitable if loop iterations take same amountof time and there are enough per processor. Forscattered decomposition, specify chuck size.

DYNAMIC: a queue of work, each processor gets chunksizeiterations when ready.

GUIDED: dynamic queue with chunks of exponentiallydecreasing size.

Stout and Jablonowski – p. 251/324

Page 265: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Load-Balancing Summary

Load-balancing is critical for high performance.

Depending on the application, can range from trivial tonearly impossible. A wide range of approaches are needed,and new ones are constantly being developed.

Load-balancing needs to be approached as part of asystematic effort to improve performance.

Stout and Jablonowski – p. 252/324

Page 266: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Load-Balancing Summary

Load-balancing is critical for high performance.

Depending on the application, can range from trivial tonearly impossible. A wide range of approaches are needed,and new ones are constantly being developed.

Load-balancing needs to be approached as part of asystematic effort to improve performance.

Try simple approaches first.

Stout and Jablonowski – p. 252/324

Page 267: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

DATA INTENSIVE COMPUTING

Databases are an important commercial application ofparallel computers, providing a base which helps keepcommercial parallel computing viable.

Massive data collections becoming important in scientificfields such as bioinformatics, astronomy, physics, . . . .

Many of the ideas are used elsewhere, though sometimesobscured by different terminology. We’ll just briefly examinesome aspects.

Stout and Jablonowski – p. 253/324

Page 268: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Application Areas

Web browsing

Real-time applications: air traffic, stock trading,streaming multimedia

Data Warehouse: organize massive amounts ofcommercial, scientific dataCERN Large Hadron Collider: ≈ 30TB/day, ≈10PB/year

Data Mining: extract useful information from vastcollections of text, photographs, web pages, etc.

Stout and Jablonowski – p. 254/324

Page 269: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Some Terminology

Often data intensive systems use terminology that issomewhat different, though often ideas similar to onesalready touched on. Some examples:

skew load imbalancescaleup speeduptransactions per second (TPS) throughput.

TPS is often used to measure performance, instead of flops

Stout and Jablonowski – p. 255/324

Page 270: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Characteristics

Disk access and bandwidth dominates performance.Organizing the information to match the accesspatterns is often critical.

Systems for scientific applications somewhat newer,complicated by factors such as being dispersed amongsites, people trying to combine or mine information innew ways, billions of files (e.g., a constant stream ofimages), etc.

Sample science collections include Large HadronCollider, Digital Sky, Earth Observation System. Manyprovide specialized tools to access the information.

Stout and Jablonowski – p. 256/324

Page 271: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Disk Architectures

Shared Everything (SE) All disks are directly accessible fromall processors and all memory is shared, i.e., standardshared memory system.

Shared Nothing (SN) Each disk is connected to a singleprocessor or SMP, each has its own private memory.Most common option in clusters.

Shared Disks (SD) Any processor can access any disk, buteach processor has its own private memory, e.g.,storage networks.

Stout and Jablonowski – p. 257/324

Page 272: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Shared Everything

P3P1 Pn

Interconnection Network

Global Shared Memoryshared disks

P2

Stout and Jablonowski – p. 258/324

Page 273: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Shared Nothing

P1 P2 P3 Pn

privatememory

privatememory

privatememory

privatememory

private disk private disk private diskprivate disk

Interconnection Network

Stout and Jablonowski – p. 259/324

Page 274: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Shared Disk

shared disk

Interconnection Network

P3P2P1 Pn

privateprivatememory

privatememorymemory

privatememory

shared disk shared disk

Stout and Jablonowski – p. 260/324

Page 275: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Data Partitioning Strategies

Range Partitioning (block allocation) Easy to locate records,related data can be clustered, but danger of skew.

Disks

Key Range

Stout and Jablonowski – p. 261/324

Page 276: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Data Partitioning continued

Round Robin (cyclic, striping) Allows parallelism in accessingconsecutive records, but ties up many disks if differentprograms running on system.

Disks

Key Range

Stout and Jablonowski – p. 262/324

Page 277: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Data Partitioning continued

Hashing Avoids systematic bottlenecks, allows forexpanding collection of keys (such as names), butcomplicates range queries.

Disks

Key Range

Stout and Jablonowski – p. 263/324

Page 278: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Data Partitioning Parallels

Block allocation and round robin allocation are used individing loops in OpenMP.

Round robin allocation used in memory systems ofvector machines.

Block allocation used in memory systems of commodityprocessors.

Hashed allocation used in memory system of Cydrome.

Stout and Jablonowski – p. 264/324

Page 279: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Data Mining

Sifting for information in a torrent of data economicallyand scientifically important.

AT&T, WalMart, American Express, . . . have used formany years. Bioinformatics important new applicationarea.

Many commercial data mining tools, often parallelized.

Warning: “data mining” means many different things todifferent people and applications.

Stout and Jablonowski – p. 265/324

Page 280: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Map-Reduce: New Form of Data Mining

Variations used by Google, Yahoo, IBM, etc.Open source Hadoop: http://hadoop.apache.org/core/

Companies trying to get schools to teach this style ofprogramming

Basic database operations, extended to less organized,far larger, systems.

Simple example: given records of (source page, link)for every company find # pages from outside thecompany that point to one of the company’s pages.

Stout and Jablonowski – p. 266/324

Page 281: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Map: determine if link record is from page outside acompany into it. If so, generate new record(destination company, 1)

embarrassingly parallel, vast number records, I/Obound

Reduce: combine records by company and sum the counts

requires communication, but far fewer records

Implementations: significant emphasis on locality, efficiency,fault tolerance

Stout and Jablonowski – p. 267/324

Page 282: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Sample Map-Reduce Execution

Source: http://code.google.com/edu/parallel/mapreduce-tutor ial.html

Stout and Jablonowski – p. 268/324

Page 283: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

PERFORMANCE

Developing large-scale scientific or commercialapplications that make optimum use of thecomputational resources is a challenge.

Resources can easily be underutilized or usedinefficiently.

The factors that determine the program’s performanceare often hidden from the developer.

Performance analysis tools are essential to optimizingthe serial or parallel application.

Typically measured in “Floating point operation persecond” like Mflop/s, Gflop/s or Tflop/s.

Stout and Jablonowski – p. 269/324

Page 284: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

CPU Performance MeasuresPerformance

is compared via benchmarks like LINPACK

more relevant: benchmarks with user application

most often on scalar machines: cache-optimizedprograms reach ≈ 10% of the peak performance

Example: Weather prediction code IFS (ECMWF)

Stout and Jablonowski – p. 270/324

Page 285: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Application-System Interplay

System factors:Chip architecture (e.g. # floating point units per CPU)Memory hierarchy (register - cache - main memory -disk)I/O configurationCompilerOperating SystemConnecting network between processors

Stout and Jablonowski – p. 271/324

Page 286: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Application-System Interplay

Application factors:Programming languageAlgorithms and implementationData structuresMemory managementLibraries (e.g. math libraries)Size and nature of data setCompiler optimization flagsUse of I/OMessage passing library / OpenMPCommunication patternTask granularityLoad balancing

Stout and Jablonowski – p. 272/324

Page 287: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Performance Gains: Hardware

Factor ≈ 104 over the last 15 yearsStout and Jablonowski – p. 273/324

Page 288: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Performance Gains: Software

Gains expected from better algorithms, example:

1970 1980 1990 2000

104

010

1

210

10

10

103

5

Gauss-Seidel

Successive Over-RelaxationConjugate Gradient

Multi-Grid

Sparse Gaussian Elimination

Spe

edup

fact

or

Derived from Computational Methods (Linear Algebra)

Gains also expected from better load-balancingstrategies, parallel I/O, etc.

Stout and Jablonowski – p. 274/324

Page 289: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Performance AnalysisReliable performance analyses are the key toimproving the performance of a parallelized program.

They reveal not only typical bottleneck situations butalso determine the hotspots

Key question: How efficient is the parallel code?

Important to consider: Time spentcommunicating to other processorswaiting for a message to be receivedwasted waiting for other processors

When selecting a performance tool consider:

How accurate is the technique?Is the tool simple to use?How intrusive is the tool?

Stout and Jablonowski – p. 275/324

Page 290: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel and Serial Performance Analysis

Goal: reduce the program’s wallclock execution timePractical, iterative approach:

measure the code with a hardware performancemonitor and profiler

analyze hotspots

optimize and parallelize hotspots and eliminatebottlenecks

evaluate performance results and improve optimization/ parallelization

Analysis techniquesTiming

Counting

Profiling

Tracing

Stout and Jablonowski – p. 276/324

Page 291: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Timing of Parallel Programs

MPI / OpenMP provide the compiler-independent timingfunctions for the wallclock time

MPI_Wtime / OMP_GET_WTIME

Requires source code changes: instrument the program

Typical sequence (MPI program):real t1, t2, secondst1 = MPI_wtime()

... code to be timedt2 = MPI_wtime()seconds = t2-t1 ! wallclock time

Evaluation of parallel speedup: Always measure thewallclock time . Measuring the CPU time would neglect thesystem overhead for the parallelization!

Stout and Jablonowski – p. 277/324

Page 292: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Hardware Performance Monitors (HPM)

Hardware counters gather performance-relevant events ofthe microprocessor without affecting the performance ofthe analyzed program. Two classes:

Processor monitor:

non-intrusive countsconsists of a group of special purpose registerregisters keep track of events during runtime:general and floating point instructions, cachemisses, branch miss predictionmeasures Mflop/s rate fairly accurately

System level monitor (bus and network monitor):

bus monitor: memory traffic, cache coherencynetwork monitor records network traffic

Stout and Jablonowski – p. 278/324

Page 293: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

PAPI: The Portable Performance API

mature public-domain Hardware Performance Monitor

version Papi 3.6.1 released in 8/2008

vendor independent hardware counter tool

supports most current processors including the “Cell”processor

user needs to instrument code⇒ PAPI functions

Fortran and C/C++ user interfaces

easy-to-use and powerful high level API

Home page:http://icl.cs.utk.edu/papi/index.html

Stout and Jablonowski – p. 279/324

Page 294: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Profiling of Parallel Programs

simplest tool: UNIX profiler gprof

interrupts program execution at constant timeintervalscounts the interruptionthe more interruptions the more time spent in thispart of the codesum of all processors is displayed

Profilers identify hotspots , but limited use for parallel code:

they measure CPU time, not wallclock time

they sum over all invocations of each routine

profilers cannot show load imbalance

Stout and Jablonowski – p. 280/324

Page 295: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Profiling: Graphical User Interfaces

Commercial: allinea opt (http://www.allinea.com ),optimization and profiling tool for multiple hardwareplatforms.

IBM AIX systems (built-in): xprofiler

Graphical user interface based upon the gprofprofiling utility.Displays: Timing and call graph profile, summarycharts, source code displays, library clusters.Filtering and zooming features allow focusing thedisplays on portions of the call tree.

Public domain Tuning and Analysis Tool TAU :http://www.cs.uoregon.edu/research/tau

Stout and Jablonowski – p. 281/324

Page 296: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

GUI Example Xprofiler

Portions of the program which accumulate the most“ticks” (interrupts) reflect the area where the programspends the most time

Stout and Jablonowski – p. 282/324

Page 297: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Profiling: Pitfalls

Due to the periodic sampling of the program counter theoutput might be slightly different when the sameprogram is profiled multiple times.

Measure the code over a representative time intervalusing typical data sets. Sampling should last at leastseveral minutes.

Optimizing compiler flags are allowed: expect differentprofile when using the -O option, try with and withoutoptimization.

Different hardware / different compilers might lead todifferent profiles.

But: the most time consuming functions should bedetected in any case, maybe in different order.

Stout and Jablonowski – p. 283/324

Page 298: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

MPI and OpenMP Trace ToolsCollect trace data at run time, display post-mortem

Assess performance, bottlenecks and load-balancingproblems in MPI & OpenMP codes

Intel’s trace visualization tool Trace Analyzer &Collector (only on Intel platforms)

Vampir and Vampirtrace (platform independent)

Trace analyzer developed and supported by the Centerfor Information Services and High PerformanceComputing, Dresden, Germany (http://vampir.eu )

Free evaluation keys for both available online.

Stout and Jablonowski – p. 284/324

Page 299: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Trace Analyzer & Collector / VampirTrace Analyzer / Vampir graphical user interface helps

understand the application behavior

evaluate load balancing

show barriers, locks, synchronization

analyze the performance of subroutines/code blocks

learn about communication and performance

identify communication hotspots

Trace Collector / Vampirtrace

Libraries that trace MPI and application events,generate trace file (files can become big!)

Convenient: Re-link your code and run it

Provides API for more detailed analysesStout and Jablonowski – p. 285/324

Page 300: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Graphical User InterfaceTrace Analyzer / Vampir provides graphical displays thatvisualize important aspects of the runtime behavior:

detailed timeline view of events and communication

statistical analysis of program execution

statistical analysis of communication operations

dynamic calling tree and source-code display

I/O statistics

Trace Analyzer / Vampir

provides powerful zooming and filtering features

can display source code references if recorded

Vampir supported on almost all HPC platforms

Stout and Jablonowski – p. 286/324

Page 301: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vampir Analysis – Global Timeline ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

here: uninstrumented version of the program

therefore: the routines of the user code can not bedistinguished and are displayed as “Application”

Stout and Jablonowski – p. 287/324

Page 302: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vampir Analysis – Zoom-in Timeline ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Zoom-in: ⇒ Communication and synchronizationStout and Jablonowski – p. 288/324

Page 303: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vampir Analysis – Activity Chart

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Global activity chart⇒ Load-imbalance

Stout and Jablonowski – p. 289/324

Page 304: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vampir Analysis – Summary Chart ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Summary for the whole application: timing data

Stout and Jablonowski – p. 290/324

Page 305: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Vampir Analysis – MPI Summary ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Stout and Jablonowski – p. 291/324

Page 306: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Public Domain Trace ToolsJumpshot-4 (http://www-unix.mcs.anl.gov/perfvis/ )

Graphical displays of timelines, histograms, MPIoverhead and more

Instant zoom in/out, search/scan facility

TAU – Tuning and Analysis Utilities (version 2.17.1)

Developed at the University of Oregon, mature

Free, portable, open-source profiling/tracing facility(http://www.cs.uoregon.edu/research/tau )

Performance instrumentation, measurement andanalysis toolkit for distributed and shared memoryapplications (includes MPI, OpenMP)

Graphical displays for all or individual processes

Manual or automatic source code instrumentationStout and Jablonowski – p. 292/324

Page 307: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Performance Analysis: Strategy

Hardware counters provide information on Mflop/srates, do you need to optimize?

Use profilers to identify hotspots

Focus the analysis/optimization efforts on the hotspots

Analyze trace information : gives detailed overview ofthe parallel performance, load-balance and revealsbottlenecks

two different modes: the uninstrumented orinstrumented mode (requires source code changes)⇒ Pitfall: can lead to huge trace files)Recommendation: instrument only hotspots fordetailed view of the run time behavior

Stout and Jablonowski – p. 293/324

Page 308: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Debugging of Parallel ProgramsIncreased parallel complexity makes the debuggingprocess more difficult.

Traditional sequential debugging technique is cyclicapproach where the program is repeatedly stopped atbreakpoints and then continued or re-executed again.

Conventional style of debugging sometimes difficult withparallel programs: they do not always showreproducible behavior , e.g. race condition .

Always: turn on compiler debugging options likearray-bound checks

Most powerful commercial debuggers:

TotalView (http://www.totalviewtech.com )

allinea ddt (http://www.allinea.com )Stout and Jablonowski – p. 294/324

Page 309: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Characteristics of Totalview

Very powerful and mature debugger, current version 8.6

Source-level, graphical debugger for C, C++, Fortran,High Performance Fortran (HPF) and assembler code

Multiprocess (MPI) and multithread (OpenMP) codes

Supports multi-platform applications

Intuitive, easy-to-learn graphical interface

Industry leader in MPI and OpenMP debugging

Control functions to run, step, breakpoint, interrupt orrestart a process

Ability to control all parallel processes coherently

Good tutorial on TotalView with parallel debugging tips:http://www.llnl.gov/computing/tutorials/totalview/

Stout and Jablonowski – p. 295/324

Page 310: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

TotalView: The Process Window5 panes

zoom intocode orvariables

visualizevariables

filter, sort orslice data

set break-points

scan parallelprocessses

step by stepexecution

Stout and Jablonowski – p. 296/324

Page 311: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

TotalView: Message Queue Graph

Graphical representation of the message queue state⇒ Red = Unexpected, Blue = Receive, Green = Send

Stout and Jablonowski – p. 297/324

Page 312: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Boost the Performance: Practical TipsTurn on compiler optimization flags

Search for better algorithms and data structures

For scientific codes: use optimized math libraries

Tune the program:data locality and cache re-use within loopsavoid divisions, indirect addressing, IF statements,especially in loopsloop unrolling and function inlining (often compileroption), minimize/optimize I/O, ...

Load-balance the code

Avoid synchronization/barriers whenever possible

Optimize partitioning to minimize communication

Identify inhibitors to parallelism: data dependencies, I/OStout and Jablonowski – p. 298/324

Page 313: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Parallel Scientific Math LibrariesParallel math libraries are available on most hardwareplatforms. Highly optimized and recommended.

ScaLAPACK (Scalable LAPACK):Public-domain, high-performance linear algebraroutines for MPI applicationsPromotes modularity via interfaces to the librariesBLAS, BLACS and PBLAS

NAG Parallel Libraries (commercial, often installed):Mostly high speed linear algebra routinesIn addition: random number generation andquadrature routines

PETSc (Portable, Extensible Toolkit for Scientificcomputation):

Designed with MPI for partial differential equationsStout and Jablonowski – p. 299/324

Page 314: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Toolkits for Scientific ComputingACTS toolkit — Advanced CompuTational Software(http://acts.nersc.gov ):

Public domain tools mostly developed at US labs

Collection of tools that is interoperable, with API

General solutions to complex programming needs

IncludesNumerical solvers: PETSc, ScaLAPACK, Aztec, ...Structural frameworks: Software that manages data

& communication like Overture and Global ArraysRuntime & support tools: CUMULUS, TAU

Eclipse : Parallel Tools Platform (PTP)

open-source project: wide variety of parallel tools

http://www.eclipse.org/ptp/Stout and Jablonowski – p. 300/324

Page 315: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

USING PARALLEL SYSTEMS

In addition to programming, there are many issuesconcerning the use of parallel systems.

For example, they are often a centralized resource thatmust be shared, much like mainframes of olden days.

Your institution may decide to purchase a system, or buytime elsewhere.

Stout and Jablonowski – p. 301/324

Page 316: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Batch Queuing

A return to 60’s style computer usage.

Large parallel systems use batch queuing, may allowsmall interactive jobs for debugging.

If there are multiple queues, learn how they arestructured and serviced — it’s you vs. them.

If submit several jobs at once, you may be your ownbottleneck. Might improve throughput by requestingfewer processors, and more time, per job(remember Amdahl’s Law).

Stout and Jablonowski – p. 302/324

Page 317: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Access to Systems

Academics: Can apply for free time at NSF supercomputingcenters or perhaps your own university. For modesttime the NSF process easy and quick, but thousands ofhours requires more detailed application. Need to showcan effectively utilize machines (e.g., speedup curves),and are doing good research.

Grants from other agencies usually include access totheir large systems.

Businesses: Can purchase time from hardware vendors,sometimes from university centers.

Stout and Jablonowski – p. 303/324

Page 318: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Purchasing Systems

Buying systems very complicated. Some questions:

Can it run your major applications? May depend onISVs.

Will vendor be around in five years?

Is there an upgrade path if you need to expand soon?

Can you get (and afford) tools, compilers, libraries fordeveloping new applications?

Is the system reliable? Is maintenance policyacceptable?

Do you have sufficient power and air conditioning?

Stout and Jablonowski – p. 304/324

Page 319: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

What to Buy?

How much of the budget on processors vs. memory vs.communication?

Do you want more, or faster, processors, i.e.,price-performance or performance?

Need to understand major applications, and deliveredversus peak performance.

Stout and Jablonowski – p. 305/324

Page 320: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Where are You on the Curve?

As a user or buyer:

Price/Performance

Processors

Speedu

Performancep

Stout and Jablonowski – p. 306/324

Page 321: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Cluster Systems

Some groups build their own, resources are available tohelp.

However, many users just look at machine cost.

Typically total costs at least twice initial costs .

Maintenance:

Many little things, hardware and software, go wrong orneed upgrading — who will keep fixing this?

Who does backups?

Maintenance time-consuming and harmful to career

Stout and Jablonowski – p. 307/324

Page 322: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

WRAP UP

We’ll review some of the material learned, discuss somegeneral problems with parallel computing, and point outsome trends in the area.

Stout and Jablonowski – p. 308/324

Page 323: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Trends in Parallel Computing

It’s useful to have a sense of where it is going.

Stout and Jablonowski – p. 309/324

Page 324: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Trends in Parallel Computing

It’s useful to have a sense of where it is going.

Stout and Jablonowski – p. 309/324

Page 325: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Trend: Power Critical

For given technology, typically power ≈ speed2

Heat limits density

Power In = Heat Out, so AC demands also increase

System speed requires close components: systemssuch as BlueGene make tradeoff, slower clock speed,and smaller RAM, for greater density and moreprocessing power.

Tradeoff opposite programmer needs — Amdahl’s law.

Stout and Jablonowski – p. 310/324

Page 326: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Power Trend

Stout and Jablonowski – p. 311/324

Page 327: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Trend: Chip Density Still Increasing

Hardware designers running out of old tricks, so justreplicate processors on chips — multi-core, many-core.

While potential chip performance continues to increase,number of I/O wires/chip doesn’t match number cores,more stress on cache locality

GPUs (and IBM Cell) have large number of simpleprocessors, very high FLOPs, need locality for efficientvector-like operations.

Stout and Jablonowski – p. 312/324

Page 328: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Chip Trends

Sources: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

Stout and Jablonowski – p. 313/324

Page 329: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Good Optimization still Bleeding Edge

Economics pushes for using commodity parts,especially since they have high potential. Unfortunately

No useful GPU programming standardsMulticores differ on caching providedNo good way to optimize for both GPU and multicore— portable optimization not yet attained

Need better compilers to exploit parallelism (e.g., muchsmarter OpenMP compilers)

Need better ways of expressing parallelism (askDARPA, Intel, Microsoft, etc.!)

Stout and Jablonowski – p. 314/324

Page 330: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

More Trends

Roadrunner grabs the headlines, but clusters andSMPs most important economically, “commodity” partsincludes chips, boards, blades,...

Increasing use of commercial parallelized software

Some parallel computing companies will fail.

Stout and Jablonowski – p. 315/324

Page 331: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Should You Parallelize?

Parallel programming is difficult — is it worthwhile?Pancake [1996] suggests first determining:

How often is program used between changes?

How much time does it take (or is expected to take)?

How satisfied are users with current results?Need more resolutionNeed results fasterWill be flooded with data,. . .

Stout and Jablonowski – p. 316/324

Page 332: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Degrees of Difficulty

Some problems much easier to parallelize than others.Classes of problems range from

Embarrassingly parallel Separate jobs with no interaction,easy to run on any system.

Static Important load-balancing parameters, such as size,known in advance. Often run same configuration manytimes.

Data-dependent Dynamic Often quite difficult to achieveefficient implementation.

Stout and Jablonowski – p. 317/324

Page 333: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Review: Software Engineering

Standard languages (e.g., MPI, OpenMP) and toolsreduce learning curve and preserve investment.

Start with overview of data structures & timerequirements, do profiling as needed.

Prioritize sections to be parallelized, and adapt as youlearn.

Parallelize at the outermost loop possible

Proceed incrementally, constantly verifying correctness

Stout and Jablonowski – p. 318/324

Page 334: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Review: Efficiency

Reduce communication costs:maximize data localityeliminate false sharing in shared memory systemscombine messages to reduce overhead andsynchronizationsend data (distributed memory) or write data (sharedmemory) early, receive or read late.

Reduce load imbalance and synchronization.

Utilize compiler optimizations, optimized routines, etc.

Stout and Jablonowski – p. 319/324

Page 335: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

If It Isn’t Working Well . . .

The original program probably wasn’t written withparallelism in mind

See if there is a more parallelizable approach

Sometimes parallelizable approaches aren’t the mostefficient ones available for serial computers, but that isOK if you are going to use many processors.

Stout and Jablonowski – p. 320/324

Page 336: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

If It Isn’t Working Well . . .

The original program probably wasn’t written withparallelism in mind

See if there is a more parallelizable approach

Sometimes parallelizable approaches aren’t the mostefficient ones available for serial computers, but that isOK if you are going to use many processors.

Remember Amdahl’s Law:

Efficient massive parallelism is difficult.

Stout and Jablonowski – p. 320/324

Page 337: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Finally • • •

Stout and Jablonowski – p. 321/324

Page 338: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

Finally • • •

Make sure your goals arerealistic, and remember thatyour own time is valuable.

Stout and Jablonowski – p. 321/324

Page 339: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

REFERENCES

Sellected web resources for parallel computing are(occasionally) maintained at

http://www.eecs.umich.edu/ ˜ qstout/parlinks.html

Stout and Jablonowski – p. 322/324

Page 340: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

References

[G. Amdahl 1967], “Validity of the single processor approach to achieving large scalecomputing capabilities”, AFIPS Conf. Proc. 30 (1967), pp. 483–485.

Co-Array Fortran: http://www.co-array.org .

[M.J. Flynn 1966], “Very high-speed computing systems”, Proc. IEEE 54 (1966),pp. 1901–1909.

[J.L. Gustafson 1988], “Reevaluating Amdahl’s Law”, Communications of the ACM 31(1988), pp. 532–533.

Hadoop: http://hadoop.apache.org/core/

Hilbert space-filling curve: see the routines available in Zoltan (listed below).

[MacNeice et al. 2002], “PARAMESH: A parallel adaptive mesh refinement communitytoolkit”, Comp. Physics Commun. 128 (2000), pp. 330–354.

Metis and Parmetis: http://www.cs.umn.edu/ ˜ karypis/metis/

MPI: documentation at http://www.mpi-forum.org/

Free, portable versions at:http://www.mcs.anl.gov/research/projects/mpich2

http://www.open-mpi.org/

Stout and Jablonowski – p. 323/324

Page 341: Quentin F. Stout Christiane Jablonowski · ATM networks, digital multimedia Parallel computers can be the cheapest or easiest way to achieve a specific computational goal at a given

References continued

OpenMP: http://openmp.org/wp/ .

[C.M. Pancake 1996], “Is parallelism for you?”, IEEE Comp. Sci. & Engin., 3 (1996)pp. 18–37.

[Pancake, Simmons, and Yan 1995], “Performance evaluation tools for parallel anddistributed systems, IEEE Computer, Vol. 28, No. 11 (1995) pp. 16–20.

Parallel computing, a slightly whimsical explanationhttp://www.eecs.umich.edu/ ˜ qstout/parallel.html

Roadrunner: http://www.lanl.gov/roadrunner/index.shtml

[Stout et al. 1997], “Adaptive blocks: A high-performance data structure”, Proc. SC’97.http://www.eecs.umich.edu/ ˜ qstout/abs/SC97.html

Top500. Website with extensive collection of references: http://www.Top500.org

UPC (Unified Parallel C): http://upc.gwu.edu .

Zoltan (collection of routines for load balancing et al.).http://www.cs.sandia.gov/Zoltan .

Stout and Jablonowski – p. 324/324