parallel computer architecture - rochester institute of...

EECC722 - ShaabanEECC722 - Shaaban#1 lec # 3 Fall 2000 9-18-2000

Parallel Computer ArchitectureParallel Computer Architecture• A parallel computer is a collection of processing elements

that cooperate to solve large problems fast

• Broad issues involved:– Resource Allocation:

• Number of processing elements (PEs).

• Computing power of each element.

• Amount of physical memory used.

– Data access, Communication and Synchronization• How the elements cooperate and communicate.

• How data is transmitted between processors.

• Abstractions and primitives for cooperation.

– Performance and Scalability• Performance enhancement of parallelism: Speedup.

• Scalabilty of performance to larger systems/problems.


Exploiting Program ParallelismExploiting Program Parallelism

Instruction

Loop

Thread

Process

Leve

ls o

f P

aral

lelis

m

Grain Size (instructions)

1 10 100 1K 10K 100K 1M


The Need And Feasibility of The Need And Feasibility of Parallel ComputingParallel Computing• Application demands: More computing cycles:

– Scientific computing: CFD, Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases,

Transaction Processing, Gaming…– Mainstream multithreaded programs, are similar to parallel programs

• Technology Trends– Number of transistors on chip growing rapidly– Clock rates expected to go up but only slowly

• Architecture Trends– Instruction-level parallelism is valuable but limited

– Coarser-level parallelism, as in MPs, the most viable approach

• Economics:– Today’s microprocessors have multiprocessor support eliminating the

need for designing expensive custom PEs– Lower parallel system cost.– Multiprocessor systems to offer a cost-effective replacement of

uniprocessor systems in mainstream computing.


Scientific Computing DemandScientific Computing Demand


Scientific Supercomputing TrendsScientific Supercomputing Trends• Proving ground and driver for innovative architecture

and advanced techniques:

– Market is much smaller relative to commercial segment

– Dominated by vector machines starting in 70s

– Meanwhile, microprocessors have made huge gains infloating-point performance

• High clock rates.

• Pipelined floating point units.

• Instruction-level parallelism.

• Effective use of caches.

• Large-scale multiprocessors replace vector supercomputers– Well under way already


Raw Uniprocessor Performance:Raw Uniprocessor Performance:LINPACKLINPACK

LIN

PA

CK

(M

FLO

PS

)

s

s

s

s

s

s

u

u

u

uu

u

uu

u

u u

1

10

100

1,000

10,000

1975 1980 1985 1990 1995 2000

sCRAYn = 100n CRAYn = 1,000

uMicro n = 100l Micro n = 1,000

CRAY 1s

Xmp/14se

Xmp/416Ymp

C90

T94

DEC 8200

IBM Power2/990MIPS R4400

HP9000/735DEC Alpha

DEC Alpha AXPHP 9000/750

IBM RS6000/540

MIPS M/2000

MIPS M/120

Sun 4/260

n

nn

n

n

n

l

l

l

l l

l ll

ll

l


Raw Parallel Performance:Raw Parallel Performance:LINPACKLINPACK

LIN

PA

CK

(GFL

OP

S)

nCRAY peaklMPP peak

Xmp /416(4)

Ymp/832(8) nCUBE/2(1024)iPSC/860

CM-2CM-200

Delta

Paragon XP/S

C90(16)

CM-5

ASCI Red

T932(32)

T3D

Paragon XP/S MP (1024)

Paragon XP/S MP (6768)

n

n

n

n

l

l

nl

l

l

ll

l

ll

0.1

1

10

100

1,000

10,000

1985 1987 1989 1991 1993 1995 1996


Tran

sist

ors

uuuuuuu

uu

u

uu

u

u

u uu

u

u

u

u

uu

uuu u

uu

u

u

u u

u

u

u

u

u

uu

u u

uuu

uuu u

uu uu u

u

uuu u

uuu

u

u uuu

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Parallelism in Microprocessor VLSI GenerationsParallelism in Microprocessor VLSI Generations


The Goal of Parallel ComputingThe Goal of Parallel Computing• Goal of applications in using parallel machines: Speedup

Speedup (p processors) =

• For a fixed problem size (input data set),performance = 1/time

Speedup fixed problem (p processors) =

Performance (p processors)

Performance (1 processor)

Time (1 processor)

Time (p processors)


Elements of Modern ComputersElements of Modern Computers

HardwareHardwareArchitectureArchitecture

Operating SystemOperating System

Applications SoftwareApplications Software

ComputingComputing Problems Problems

AlgorithmsAlgorithmsand Dataand DataStructuresStructures

High-levelHigh-levelLanguagesLanguages

Performance Performance Evaluation Evaluation

MappingMapping

ProgrammingProgramming

BindingBinding(Compile, (Compile, Load)Load)


Elements of Modern ComputersElements of Modern Computers1 Computing Problems:

– Numerical Computing: Science and technology numericalproblems demand intensive integer and floating pointcomputations.

– Logical Reasoning: Artificial intelligence (AI) demand logicinferences and symbolic manipulations and large space searches.

2 Algorithms and Data Structures– Special algorithms and data structures are needed to specify the

computations and communication present in computingproblems.

– Most numerical algorithms are deterministic using regular datastructures.

– Symbolic processing may use heuristics or non-deterministicsearches.

– Parallel algorithm development requires interdisciplinaryinteraction.


Elements of Modern ComputersElements of Modern Computers3 Hardware Resources

– Processors, memory, and peripheral devices form thehardware core of a computer system.

– Processor instruction set, processor connectivity, memoryorganization, influence the system architecture.

4 Operating Systems– Manages the allocation of resources to running processes.

– Mapping to match algorithmic structures with hardwarearchitecture and vice versa: processor scheduling, memorymapping, interprocessor communication.

– Parallelism exploitation at: algorithm design, programwriting, compilation, and run time.


Elements of Modern ComputersElements of Modern Computers5 System Software Support

– Needed for the development of efficient programs in high-level languages (HLLs.)

– Assemblers, loaders.

– Portable parallel programming languages

– User interfaces and tools.

6 Compiler Support– Preprocessor compiler: Sequential compiler and low-level

library of the target parallel computer.

– Precompiler: Some program flow analysis, dependencechecking, limited optimizations for parallelism detection.

– Parallelizing compiler: Can automatically detectparallelism in source code and transform sequential codeinto parallel constructs.


Approaches to Parallel ProgrammingApproaches to Parallel Programming

Source code written in Source code written inconcurrent dialects of C, C++concurrent dialects of C, C++ FORTRAN, LISP FORTRAN, LISP ..

ProgrammerProgrammer

ConcurrencyConcurrencypreserving compilerpreserving compiler

ConcurrentConcurrentobject codeobject code

Execution by Execution byruntime systemruntime system

Source code written in Source code written insequential languages C, C++sequential languages C, C++ FORTRAN, LISP FORTRAN, LISP ..

ProgrammerProgrammer

ParallelizingParallelizing compiler compiler

Parallel Parallelobject codeobject code

Execution by Execution byruntime systemruntime system

(a) Implicit (a) Implicit Parallelism Parallelism

(b) Explicit (b) Explicit Parallelism Parallelism

EECC722 - ShaabanEECC722 - Shaaban

Evolution of ComputerEvolution of ComputerArchitectureArchitecture

Scalar

Sequential Lookahead

I/E Overlap FunctionalParallelism

MultipleFunc. Units Pipeline

Implicit Vector

Explicit Vector

MIMDSIMD

MultiprocessorMulticomputer

Register-to -Register

Memory-to -Memory

Processor Array

Associative Processor

Massively Parallel Processors (MPPs)

I/E: Instruction Fetch and Execute

SIMD: Single Instruction stream over Multiple Data streams

MIMD: Multiple Instruction streams over Multiple Data streams


Parallel Architectures HistoryParallel Architectures History

Application Software

System Software SIMD

Message Passing

Shared MemoryDataflow

SystolicArrays Architecture

Historically, parallel architectures tied to programming models

• Divergent architectures, with no predictable pattern of growth.


Programming ModelsProgramming Models• Programming methodology used in coding applications

• Specifies communication and synchronization

• Examples:

– Multiprogramming:

No communication or synchronization at program level

– Shared memory address space:

– Message passing:

Explicit point to point communication

– Data parallel: More regimented, global actions on data

• Implemented with shared address space or message passing


Flynn’s 1972 Classification ofFlynn’s 1972 Classification ofComputer ArchitectureComputer Architecture

• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines.

• Single Instruction stream over Multiple Data streams(SIMD): Vector computers, array of synchronized processing elements.

• Multiple Instruction streams and a Single Data stream(MISD): Systolic arrays for pipelined execution.

• Multiple Instruction streams over Multiple Data streams(MIMD): Parallel computers:

• Shared memory multiprocessors.

• Multicomputers: Unshared distributed memory,message-passing used instead.


Flynn’s Classification of Computer Architecture

Fig. 1.3 page 12 in

Advanced Computer Architecture: Parallelism,Scalability, Programmability, Hwang, 1993.


Current Trends In Current Trends In Parallel ArchitecturesParallel Architectures

• The extension of “computer architecture” to supportcommunication and cooperation:

– OLD: Instruction Set Architecture

– NEW: Communication Architecture

• Defines:– Critical abstractions, boundaries, and primitives

(interfaces)

– Organizational structures that implement interfaces(hardware or software)

• Compilers, libraries and OS are important bridges today


Modern Parallel ArchitectureModern Parallel ArchitectureLayered FrameworkLayered Framework

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Database Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compilationor library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary


Shared Address Space Parallel ArchitecturesShared Address Space Parallel Architectures

• Any processor can directly reference any memory location

– Communication occurs implicitly as result of loads and stores

• Convenient:– Location transparency

– Similar programming model to time-sharing in uniprocessors• Except processes run on different processors

• Good throughput on multiprogrammed workloads

• Naturally provided on a wide range of platforms– Wide range of scale: few to hundreds of processors

• Popularly known as shared memory machines or model– Ambiguous: Memory may be physically distributed among

processors


Shared Address Space (SAS) Model• Process: virtual address space plus one or more threads of control

• Portions of address spaces of processes are shared

• Writes to shared address visible to other threads (in other processes too)

• Natural extension of the uniprocessor model:• Conventional memory operations used for communication

• Special atomic operations needed for synchronization

• OS uses shared memory to coordinate processes

St or e

P1

P2

Pn

P0

Load

P0 pr i vat e

P1 pr i vat e

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses


Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors• The Uniform Memory Access (UMA) Model:

– The physical memory is shared by all processors.

– All processors have equal access to all memory addresses.

• Distributed memory or Nonuniform Memory Access(NUMA) Model:

– Shared memory is physically distributed locally amongprocessors.

• The Cache-Only Memory Architecture (COMA) Model:

– A special case of a NUMA machine where all distributedmain memory is converted to caches.

– No memory hierarchy at each processor.


Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor Processor

Interconnect

I/Odevices

M ° ° °M M

Network

P

$

P

$

P

$

° ° °

Network

D

P

C

D

P

C

D

P

C

Distributed memory or Nonuniform Memory Access (NUMA) Model

Uniform Memory Access (UMA) ModelInterconnect: Bus, Crossbar, Multistage networkP: ProcessorM: MemoryC: CacheD: Cache directory

Cache-Only Memory Architecture (COMA)


Uniform Memory Access Example:Uniform Memory Access Example:Intel Pentium Pro QuadIntel Pentium Pro Quad

P-Pr o bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Pr omodule

P-Pr omodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

• All coherence and multiprocessingglue in processor module

• Highly integrated, targeted at highvolume

• Low latency and bandwidth


Uniform Memory Access Example:Uniform Memory Access Example:SUN EnterpriseSUN Enterprise

– 16 cards of either type: processors + memory, or I/O

– All memory accessed over bus, so symmetric

– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 Fi

berC

hann

el

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O car ds


Distributed Shared-MemoryDistributed Shared-MemoryMultiprocessor System Example:Multiprocessor System Example:

Cray T3ECray T3E

Switch

P

$

XY

Z

Exter nal I/O

Memctrl

and NI

Mem

• Scale up to 1024 processors, 480MB/s links

• Memory controller generates communication requests for nonlocalreferences

• No hardware mechanism for coherence (SGI Origin etc. provide this)


Message-Passing MulticomputersMessage-Passing Multicomputers• Comprised of multiple autonomous computers (nodes).

• Each node consists of a processor, local memory, attachedstorage and I/O peripherals.

• Programming model more removed from basic hardwareoperations

• Local memory is only accessible by local processors.

• A message passing network provides point-to-point staticconnections among the nodes.

• Inter-node communication is carried out by message passingthrough the static connection network

• Process communication achieved using a message-passingprogramming environment.


Message-Passing AbstractionMessage-Passing Abstraction

• Send specifies buffer to be transmitted and receiving process

• Recv specifies sending process and application storage to receive into

• Memory to memory copy, but need to name processes

• Optional tag on send and matching rule on receive

• User process names local data and entities in process/tag space too

• In simplest form, the send/recv match achieves pairwise synch event

• Many overheads: copying, buffer management, protection

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local pr ocessaddr ess space

Local pr ocessaddr ess space


Message-Passing Example: IBM SP-2Message-Passing Example: IBM SP-2

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC• Made out of essentially

complete RS6000workstations

• Network interfaceintegrated in I/O bus(bandwidth limited by I/Obus)


Message-Passing Example:Message-Passing Example:Intel ParagonIntel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Supercomputer


Message-Passing Programming ToolsMessage-Passing Programming Tools• Message-passing programming libraries include:

– Message Passing Interface (MPI):• Provides a standard for writing concurrent message-passing

programs.

• MPI implementations include parallel libraries used byexisting programming languages.

– Parallel Virtual Machine (PVM):• Enables a collection of heterogeneous computers to used as a

coherent and flexible concurrent computational resource.

• PVM support software executes on each machine in a user-configurable pool, and provides a computationalenvironment of concurrent applications.

• User programs written for example in C, Fortran or Java areprovided access to PVM through the use of calls to PVMlibrary routines.


Data Parallel Systems Data Parallel Systems SIMD in Flynn taxonomySIMD in Flynn taxonomy• Programming model

– Operations performed in parallel on eachelement of data structure

– Logically single thread of control, performssequential or parallel steps

– Conceptually, a processor is associated witheach data element

• Architectural model– Array of many simple, cheap processors each

with little memory

• Processors don’t sequence throughinstructions

– Attached to a control processor that issuesinstructions

– Specialized and general communication, cheapglobal synchronization

• Some recent machines:– Thinking Machines CM-1, CM-2 (and CM-5)

– Maspar MP-1 and MP-2,

PE PE PE° ° °

PE PE PE° ° °

PE PE PE° ° °

° ° ° ° ° ° ° ° °

Controlprocessor


DataflowDataflow Architectures Architectures• Represent computation as a graph of essential dependences

– Logical processor at each node, activated by availability of operands

– Message (tokens) carrying tag of next instruction sent to next processor

– Tag compared with others in matching store; match fires execution

1 b

a

+ − ×

×

×

c e

d

f

Dataflow graph

f = a × d

Network

Token store

Waiting Matching

Instruction fetch

Execute

Token queue

Form token

Network

Network

Program store

a = (b +1) × (b − c) d = c × e


Systolic ArchitecturesSystolic Architectures

M

PE

M

PE PE PE

• Replace single processor with an array of regular processing elements

• Orchestrate data flow for high throughput with less memory access

• Different from pipelining

– Nonlinear array structure, multidirection data flow, each PEmay have (small) local instruction and data memory

• Different from SIMD: each PE may do something different

• Initial motivation: VLSI enables inexpensive special-purpose chips

• Represent algorithms directly by chips connected in regular pattern


Parallel ProgramsParallel Programs•• Conditions of Parallelism:Conditions of Parallelism:

–– Data DependenceData Dependence–– Control DependenceControl Dependence–– Resource DependenceResource Dependence–– Bernstein’s ConditionsBernstein’s Conditions

•• Asymptotic Notations for Algorithm AnalysisAsymptotic Notations for Algorithm Analysis• Parallel Random-Access Machine (PRAM)

–– Example: sum algorithm on P processor PRAMExample: sum algorithm on P processor PRAM•• Network Model of Message-Passing Network Model of Message-Passing MulticomputersMulticomputers

– Example: Asynchronous Matrix Vector Product on a Ring•• Levels of Parallelism in Program ExecutionLevels of Parallelism in Program Execution•• Hardware Vs. Software ParallelismHardware Vs. Software Parallelism•• Parallel Task Grain SizeParallel Task Grain Size•• Example Motivating Problems With high levels of concurrencyExample Motivating Problems With high levels of concurrency•• Limited Concurrency: Amdahl’s LawLimited Concurrency: Amdahl’s Law•• Parallel Performance Metrics: Degree of Parallelism (DOP)Parallel Performance Metrics: Degree of Parallelism (DOP)•• Concurrency ProfileConcurrency Profile

•• Steps in Creating a Parallel Program:Steps in Creating a Parallel Program:–– Decomposition, Assignment, Orchestration, MappingDecomposition, Assignment, Orchestration, Mapping–– Program Partitioning ExampleProgram Partitioning Example–– Static Multiprocessor Scheduling ExampleStatic Multiprocessor Scheduling Example


Conditions of Parallelism:Conditions of Parallelism:Data DependenceData Dependence

1 True Data or Flow Dependence: A statement S2 is datadependent on statement S1 if an execution path existsfrom S1 to S2 and if at least one output variable of S1feeds in as an input operand used by S2

denoted by S1 → → S2

2 Antidependence: Statement S2 is antidependent on S1if S2 follows S1 in program order and if the output ofS2 overlaps the input of S1

denoted by S1 +→+→ S2

3 Output dependence: Two statements are outputdependent if they produce the same output variable

denoted by S1 ο→→ S2


Conditions of Parallelism: Data DependenceConditions of Parallelism: Data Dependence

4 I/O dependence: Read and write are I/O statements.I/O dependence occurs not because the same variable isinvolved but because the same file is referenced by bothI/O statements.

5 Unknown dependence:

• Subscript of a variable is subscribed (indirectaddressing)

• The subscript does not contain the loop index.

• A variable appears more than once with subscriptshaving different coefficients of the loop variable.

• The subscript is nonlinear in the loop indexvariable.


Data and I/O Dependence: ExamplesData and I/O Dependence: Examples A -

B -

S1: Load R1,AS2: Add R2, R1S3: Move R1, R3

S4: Store B, R1

S1: Read (4),A(I) /Read array A from tape unit 4/S2: Rewind (4) /Rewind tape unit 4/S3: Write (4), B(I) /Write array B into tape unit 4/S4: Rewind (4) /Rewind tape unit 4/

S1

S3

S4 S2

Dependence graph

S1 S3I/O

I/O dependence caused by accessing thesame file by the read and write statements


Conditions of ParallelismConditions of Parallelism• Control Dependence:

– Order of execution cannot be determined before runtimedue to conditional statements.

• Resource Dependence:– Concerned with conflicts in using shared resources

including functional units (integer, floating point), memoryareas, among parallel tasks.

• Bernstein’s Conditions:

Two processes P1 , P2 with input sets I1, I2 and output setsO1, O2 can execute in parallel (denoted by P1 || P2) if:

I1 ∩ O2 = ∅I2 ∩ O1 = ∅O1 ∩ O2 = ∅


Bernstein’s Conditions: An ExampleBernstein’s Conditions: An Example• For the following instructions P1, P2, P3, P4, P5 in program order and

– Instructions are in program order– Each instruction requires one step to execute– Two adders are available

P1 : C = D x E

P2 : M = G + C

P3 : A = B + C

P4 : C = L + M

P5 : F = G ÷÷ E

Using Bernstein’s Conditions after checking statement pairs: P1 || P5 , P2 || P3 , P2 || P5 , P5 || P3 , P4 || P5

X P1

D E

+3 P4

+2 P3+1 P2

C

BG

L ÷ P5

G E

FAC

X P1

D E

+1 P2

+3 P4

÷ P5

G

B

F

C

+2 P3

AL

E GC

M

Parallel execution in three stepsassuming two adders are available per step

Sequential execution

Time

XP1

÷

P5

+2

+3 +1

P2 P4

P3

Dependence graph:Data dependence (solid lines)Resource dependence (dashed lines)


Asymptotic Notations for Algorithm AnalysisAsymptotic Notations for Algorithm Analysis Asymptotic analysis of computing time of an algorithm f(n)

ignores constant execution factors and concentrates ondetermining the order of magnitude of algorithm performance.

♦ Upper bound:Used in worst case analysis of algorithm performance.

f(n) = O(g(n))

iff there exist two positive constants c and n0 such that

| f(n) | ≤ ≤ c | g(n) | for all n > n0

⇒ i.e. g(n) an upper bound on f(n)

O(1) < O(log n) < O(n) < O(n log n) < O (n2) < O(n3) < O(2n)


Asymptotic Notations for Algorithm AnalysisAsymptotic Notations for Algorithm Analysis♦ Lower bound:

Used in the analysis of the lower limit of algorithm performance

f(n) = Ω(g(n))

if there exist positive constants c, n0 such that

| f(n) | ≥ c | g(n) | for all n > n0

⇒ i.e. g(n) is a lower bound on f(n)

♦ Tight bound: Used in finding a tight limit on algorithm performance

f(n) = Θ (g(n))

if there exist constant positive integers c1, c2, and n0 such that

c1 | g(n) | ≤ | f(n) | ≤ c2 | g(n) | for all n > n0

⇒ i.e. g(n) is both an upper and lower bound on f(n)


The Growth Rate of Common Computing FunctionsThe Growth Rate of Common Computing Functions

log n n n log n n2 n3 2n

0 1 0 1 1 2 1 2 2 4 8 4 2 4 8 16 64 16 3 8 24 64 512 256 4 16 64 256 4096 65536 5 32 160 1024 32768 4294967296


Theoretical Models of Parallel ComputersTheoretical Models of Parallel Computers• Parallel Random-Access Machine (PRAM):

– n processor, global shared memory model.

– Models idealized parallel computers with zero synchronizationor memory access overhead.

– Utilized parallel algorithm development and scalability andcomplexity analysis.

• PRAM variants: More realistic models than pure PRAM

– EREW-PRAM: Simultaneous memory reads or writes to/fromthe same memory location are not allowed.

– CREW-PRAM: Simultaneous memory writes to the samelocation is not allowed.

– ERCW-PRAM: Simultaneous reads from the same memorylocation are not allowed.

– CRCW-PRAM: Concurrent reads or writes to/from the samememory location are allowed.


Example: sum algorithm on P processor PRAMExample: sum algorithm on P processor PRAM

•Input: Array A of size n = 2k

in shared memory

•Initialized local variables: •the order n,

•number of processors p = 2q ≤ n,

• the processor number s

•Output: The sum of the elements

of A stored in shared memory

begin

1. for j = 1 to l ( = n/p) do

Set B(l(s - 1) + j): = A(l(s-1) + j)

2. for h = 1 to log n do

2.1 if (k- h - q ≥ ≥ 0) then

for j = 2k-h-q(s-1) + 1 to 2k-h-qS do

Set B(j): = B(2j -1) + B(2s)

2.2 else if (s ≤ ≤ 2k-h) then

Set B(s): = B(2s -1 ) + B(2s)

3. if (s = 1) then set S: = B(1)

endRunning time analysis:• Step 1: takes O(n/p) each processor executes n/p operations•The hth of step 2 takes O(n / (2hp)) since each processor has to perform (n / (2hp)) ¬ operations• Step three takes O(1)•Total Running time:

p hh

n

T n On

p

n

pO

n

pn( ) ( log )

log

= +

= +

=∑

21


Example: Sum Algorithm on P Processor PRAMExample: Sum Algorithm on P Processor PRAM

Operation represented by a nodeis executed by the processorindicated below the node.

B(6)=A(6)

P3

B(5)=A(5)

P3

P3

B(3)

B(8)=A(8)

P4

B(7)=A(7)

P4

P4

B(4)

B(2)=A(2)

P1

B(1)=A(1)

P1

P1

B(1)

B(4)=A(4)

P2

B(3)=A(3)

P2

P2

B(2)

B(2)

P2

B(1)

P1

B(1)

P1

S= B(1)

P1

For n = 8 p = 4Processor allocation for computing the sum of 8 elements on 4 processor PRAM

5

4

3

2

1

TimeUnit


The Power of The PRAM ModelThe Power of The PRAM Model• Well-developed techniques and algorithms to handle

many computational problems exist for the PRAM model

• Removes algorithmic details regarding synchronizationand communication, concentrating on the structuralproperties of the problem.

• Captures several important parameters of parallelcomputations. Operations performed in unit time, aswell as processor allocation.

• The PRAM design paradigms are robust and manynetwork algorithms can be directly derived from PRAMalgorithms.

• It is possible to incorporate synchronization andcommunication into the shared-memory PRAM model.


Performance of Parallel AlgorithmsPerformance of Parallel Algorithms• Performance of a parallel algorithm is typically measured

in terms of worst-case analysis.

• For problem Q with a PRAM algorithm that runs in timeT(n) using P(n) processors, for an instance size of n:– The time-processor product C(n) = T(n) . P(n) represents the

cost of the parallel algorithm.

– For P < P(n), each of the of the T(n) parallel steps issimulated in O(P(n)/p) substeps. Total simulation takesO(T(n)P(n)/p)

– The following four measures of performance areasymptotically equivalent:

• P(n) processors and T(n) time

• C(n) = P(n)T(n) cost and T(n) time

• O(T(n)P(n)/p) time for any number of processors p < P(n)

• O(C(n)/p + T(n)) time for any number of processors.


Network Model of Message-Passing MulticomputersNetwork Model of Message-Passing Multicomputers

• A network of processors can viewed as a graph G (N,E)– Each node i ∈∈ N represents a processor

– Each edge (i,j) ∈∈ E represents a two-way communicationlink between processors i and j.

– Each processor is assumed to have its own local memory.

– No shared memory is available.

– Operation is synchronous or asynchronous(messagepassing).

– Typical message-passing communication constructs:• send(X,i) a copy of X is sent to processor Pi, execution

continues.

• receive(Y, j) execution suspended until the data fromprocessor Pj is received and stored in Y then executionresumes.


Network Model of MulticomputersNetwork Model of Multicomputers• Routing is concerned with delivering each message from

source to destination over the network.

• Additional important network topology parameters:– The network diameter is the maximum distance between

any pair of nodes.

– The maximum degree of any node in G

• Example:– Linear array: P processors P1, …, Pp are connected in

a linear array where:• Processor Pi is connected to Pi-1 and Pi+1 if they exist.

• Diameter is p-1; maximum degree is 2

– A ring is a linear array of processors where processors P1and Pp are directly connected.


A Four-Dimensional HypercubeA Four-Dimensional Hypercube• Two processors are connected if their binary indices

differ in one bit position.


Example: Asynchronous Matrix Vector Product on a Ring• Input:

– n x n matrix A ; vector x of order n

– The processor number i. The number of processors

– The ith submatrix B = A( 1:n, (i-1)r +1 ; ir) of size n x r where r = n/p

– The ith subvector w = x(i - 1)r + 1 : ir) of size r

• Output:

– Processor Pi computes the vector y = A1x1 + …. Aixi and passes the resultto the right

– Upon completion P1 will hold the product Ax

Begin

1. Compute the matrix vector product z = Bw

2. If i = 1 then set y: = 0

else receive(y,left)

3. Set y: = y +z

4. send(y, right)

5. if i =1 then receive(y,left)

End

Tcomp = k(n2/p)Tcomm = p(l+ mn)T = Tcomp + Tcomm

= k(n2/p) + p(l+ mn)


Creating a Parallel ProgramCreating a Parallel Program• Assumption: Sequential algorithm to solve problem is given

– Or a different algorithm with more inherent parallelism is devised.

– Most programming problems have several parallel solutions. The bestsolution may differ from that suggested by existing sequentialalgorithms.

One must:– Identify work that can be done in parallel– Partition work and perhaps data among processes– Manage data access, communication and synchronization

– Note: work includes computation, data access and I/O

Main goal: Speedup (plus low programming effort and resource needs)

Speedup (p) =

For a fixed problem:Speedup (p) =

Performance(p)Performance(1)

Time(1)

Time(p)


Some Important ConceptsSome Important ConceptsTask:

– Arbitrary piece of undecomposed work in parallel computation

– Executed sequentially on a single processor; concurrency is onlyacross tasks

– E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace

– Fine-grained versus coarse-grained tasks

Process (thread):– Abstract entity that performs the tasks assigned to processes

– Processes communicate and synchronize to perform their tasks

Processor:– Physical engine on which process executes– Processes virtualize machine to programmer

• first write program in terms of processes, then map to processors


Levels of Parallelism in Program ExecutionLevels of Parallelism in Program Execution

Jobs or programs (Multiprogramming)Level 5

Subprograms, job steps or related parts of a program

Level 4

Procedures, subroutines, or co-routines

Level 3

Non-recursive loops or unfolded iterations

Level 2

Instructions or statements

Level 1

Increasing communicationsdemand and mapping/scheduling overhead

Higherdegree ofParallelism

MediumGrain

CoarseGrain

FineGrain


Hardware and Software ParallelismHardware and Software Parallelism• Hardware parallelism:

– Defined by machine architecture, hardware multiplicity(number of processors available) and connectivity.

– Often a function of cost/performance tradeoffs.

– Characterized in a single processor by the number ofinstructions k issued in a single cycle (k-issue processor).

– A multiprocessor system with n k-issue processor canhandle a maximum limit of nk threads.

• Software parallelism:– Defined by the control and data dependence of programs.

– Revealed in program profiling or program flow graph.

– A function of algorithm, programming style and compileroptimization.


Computational Parallelism and Grain SizeComputational Parallelism and Grain Size• Grain size (granularity) is a measure of the amount of

computation involved in a task in parallel computation :– Instruction Level:

• At instruction or statement level.

• 20 instructions grain size or less.

• For scientific applications, parallelism at this level rangefrom 500 to 3000 concurrent statements

• Manual parallelism detection is difficult but assisted byparallelizing compilers.

– Loop level:• Iterative loop operations.

• Typically, 500 instructions or less per iteration.

• Optimized on vector parallel computers.

• Independent successive loop operations can be vectorized orrun in SIMD mode.


Computational Parallelism and Grain SizeComputational Parallelism and Grain Size– Procedure level:

• Medium-size grain; task, procedure, subroutine levels.

• Less than 2000 instructions.

• More difficult detection of parallel than finer-grain levels.

• Less communication requirements than fine-grainparallelism.

• Relies heavily on effective operating system support.

– Subprogram level:• Job and subprogram level.

• Thousands of instructions per grain.

• Often scheduled on message-passing multicomputers.

– Job (program) level, or Multiprogrammimg:• Independent programs executed on a parallel computer.

• Grain size in tens of thousands of instructions.


Example Motivating Problems:Example Motivating Problems:Simulating Ocean CurrentsSimulating Ocean Currents

– Model as two-dimensional grids

– Discretize in space and time

• finer spatial and temporal resolution => greater accuracy

– Many different computations per time step

• set up and solve equations

– Concurrency across and within grid computations

(a) Cross sections (b) Spatial discretization of a cross section


Example Motivating Problems:Simulating Galaxy EvolutionSimulating Galaxy Evolution

m1m2

r2

• Many time-steps, plenty of concurrency across stars within one

Star on which for cesare being computed

Star too close toapproximate

Small gr oup far enough away toapproximate to center of mass

Large gr oup farenough away toapproximate

–Simulate the interactions of many stars evolving over time

–Computing forces is expensive

–O(n2) brute force approach

–Hierarchical Methods take advantage of force law: G


Example Motivating Problems:Example Motivating Problems:Rendering Scenes by Ray TracingRendering Scenes by Ray Tracing

– Shoot rays into scene through pixels in image plane

– Follow their paths• They bounce around as they strike objects

• They generate new rays: ray tree per input ray

– Result is color and opacity for that pixel

– Parallelism across rays

• All above case studies have abundant concurrency


Limited Concurrency: Amdahl’s LawLimited Concurrency: Amdahl’s Law–Most fundamental limitation on parallel speedup.– If fraction s of seqeuential execution is inherently serial,

speedup <= 1/s–Example: 2-phase calculation

• sweep over n-by-n grid and do some independent computation• sweep again and add each value to global sum

–Time for first phase = n2/p–Second phase serialized at global variable, so time = n2

–Speedup <= or at most 2

–Possible Trick: divide second phase into two• Accumulate into private sum during sweep• Add per-process private sum into global sum

–Parallel time is n2/p + n2/p + p, and speedup at best

2n2

n2

p + n2

2n2

2n2 + p2


Amdahl’s Law Example:Amdahl’s Law Example:A Pictorial DepictionA Pictorial Depiction

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)


Parallel Performance MetricsParallel Performance MetricsDegree of Parallelism (DOP)Degree of Parallelism (DOP)

• For a given time period, DOP reflects the number of processors ina specific parallel computer actually executing a particular parallelprogram.

• Average Parallelism:– given maximum parallelism = m– n homogeneous processors

– computing capacity of a single processor ∆– Total amount of work W (instructions, computations):

or as a discrete summation W i ii

m

t==

∑∆ .1

W DOP t dtt

t

= ∫∆ ( )1

2

A DOP t dtt t t

t

=− ∫1

2 1 1

2

( ) A i ii

m

ii

m

t t=

= =∑ ∑.

1 1

ii

m

t t t=∑ = −

12 1Where ti is the total time that DOP = i and

The average parallelism A:

In discrete form


Example: Concurrency Profile ofExample: Concurrency Profile ofA Divide-and-Conquer AlgorithmA Divide-and-Conquer Algorithm

• Execution observed from t1 = 2 to t2 = 27• Peak parallelism m = 8• A = (1x5 + 2x3 + 3x4 + 4x6 + 5x2 + 6x2 + 8x3) / (5 + 3+4+6+2+2+3) = 93/25 = 3.72

Degree of Parallelism (DOP)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

11

10

9

8

7

6

5

4

3

2

1

Timet1 t2

A i ii

m

ii

m

t t=

= =

∑ ∑.1 1


Parallel Performance ExampleParallel Performance Example• The execution time T for three parallel programs is given in terms of processor

count P and problem size N

• In each case, we assume that the total computation work performed byan optimal sequential algorithm scales as N+N2 .

1 For first parallel algorithm: T = N + N2/P This algorithm partitions the computationally demanding O(N2) component of

the algorithm but replicates the O(N) component on every processor. Thereare no other sources of overhead.

2 For the second parallel algorithm: T = (N+N2 )/P + 100 This algorithm optimally divides all the computation among all processors but

introduces an additional cost of 100.

3 For the third parallel algorithm: T = (N+N2 )/P + 0.6P2

This algorithm also partitions all the computation optimally but introducesan additional cost of 0.6P2.

• All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100 . However,they behave differently in other situations as shown next.

• With N=100 , all three algorithms perform poorly for larger P , although Algorithm (3)does noticeably worse than the other two.

• When N=1000 , Algorithm (2) is much better than Algorithm (1) for larger P .


ParallelParallelPerformancePerformance

ExampleExample(continued)(continued)

All algorithms achieve:Speedup = 10.8 when P = 12 and N=100

N=1000 , Algorithm (2) performs much better than Algorithm (1)for larger P .

Algorithm 1: T = N + N2/PAlgorithm 2: T = (N+N2 )/P + 100Algorithm 3: T = (N+N2 )/P + 0.6P2


Steps in Creating a Parallel ProgramSteps in Creating a Parallel Program

• 4 steps:

Decomposition, Assignment, Orchestration, Mapping– Done by programmer or system software (compiler,

runtime, ...)– Issues are the same, so assume programmer does it all

explicitly

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration


DecompositionDecomposition• Break up computation into concurrent tasks to be divided among

processes:– Tasks may become available dynamically.– No. of available tasks may vary with time.– Together with assignment, also called partitioning.

i.e. identify concurrency and decide level at which to exploit it.

• Grain-size problem:– To determine the number and size of grains or tasks in a parallel

program.– Problem and machine-dependent.– Solutions involve tradeoffs between parallelism, communication and

scheduling/synchronization overhead.

• Grain packing:– To combine multiple fine-grain nodes into a coarse grain node (task)

to reduce communication delays and overall scheduling overhead.

Goal: Enough tasks to keep processes busy, but not too many– No. of tasks available at a time is upper bound on achievable speedup


AssignmentAssignment• Specifying mechanisms to divide work up among processes:

– Together with decomposition, also called partitioning.– Balance workload, reduce communication and management cost

• Partitioning problem:

– To partition a program into parallel branches, modules to givethe shortest possible execution on a specific parallel architecture.

• Structured approaches usually work well:– Code inspection (parallel loops) or understanding of application.– Well-known heuristics.– Static versus dynamic assignment.

• As programmers, we worry about partitioning first:– Usually independent of architecture or programming model.– But cost and complexity of using primitives may affect decisions.


OrchestrationOrchestration– Naming data.

– Structuring communication.

– Synchronization.

– Organizing data structures and scheduling tasks temporally.

• Goals– Reduce cost of communication and synch. as seen by processors

– Reserve locality of data reference (incl. data structure organization)

– Schedule tasks to satisfy dependences early

– Reduce overhead of parallelism management

• Closest to architecture (and programming model &language).– Choices depend a lot on comm. abstraction, efficiency of primitives.

– Architects should provide appropriate primitives efficiently.


MappingMapping• Each task is assigned to a processor in a manner that attempts to

satisfy the competing goals of maximizing processor utilization andminimizing communication costs.

• Mapping can be specified statically or determined at runtime byload-balancing algorithms (dynamic scheduling).

• Two aspects of mapping:– Which processes will run on the same processor, if necessary– Which process runs on which particular processor

• mapping to a network topology

• One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset– Processes can be pinned to processors, or left to OS.

• Another extreme: complete resource management control to OS– OS uses the performance techniques we will discuss later.

• Real world is between the two.– User specifies desires in some aspects, system may ignore


Program Partitioning ExampleProgram Partitioning Example

Example 2.4 page 64Fig 2.6 page 65Fig 2.7 page 66In Advanced ComputerArchitecture, Hwang


Static Multiprocessor SchedulingStatic Multiprocessor SchedulingDynamic multiprocessor scheduling is an NP-hard problem.

Node Duplication: to eliminate idle time and communication delays, somenodes may be duplicated in more than one processor.

Fig. 2.8 page 67

Example: 2.5 page 68In Advanced ComputerArchitecture, Hwang


Table 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology


Successive RefinementSuccessive RefinementPartitioning is often independent of architecture, and may bedone first:

– View machine as a collection of communicating processors• Balancing the workload.

• Reducing the amount of inherent communication

• Reducing extra work.

– Above three issues are conflicting.

Then deal with interactions with architecture:– View machine as an extended memory hierarchy

• Extra communication due to architectural interactions.

• Cost of communication depends on how it is structured

– This may inspire changes in partitioning.


Partitioning for PerformancePartitioning for Performance• Balancing the workload and reducing wait time at synch

points

• Reducing inherent communication.

• Reducing extra work.

These algorithmic issues have extreme trade-offs:– Minimize communication => run on 1 processor.

=> extreme load imbalance.

– Maximize load balance => random assignment of tiny tasks.

=> no control over communication.

– Good partition may imply extra work to compute or manage it

• The goal is to compromise between the above extremes– Fortunately, often not difficult in practice.


Load Balancing and Synch WaitLoad Balancing and Synch WaitTime ReductionTime Reduction

Limit on speedup:

– Work includes data access and other costs.

– Not just equal work, but must be busy at same time.

Four parts to load balancing and reducing synch wait time:

1. Identify enough concurrency.

2. Decide how to manage it.

3. Determine the granularity at which to exploit it

4. Reduce serialization and cost of synchronization

Sequential WorkMax Work on any Processor

Speedupproblem(p) ≤


Managing ConcurrencyManaging ConcurrencyStatic versus Dynamic techniques

Static:– Algorithmic assignment based on input; won’t change

– Low runtime overhead

– Computation must be predictable

– Preferable when applicable (except inmultiprogrammed/heterogeneous environment)

Dynamic:– Adapt at runtime to balance load

– Can increase communication and reduce locality

– Can increase task management overheads


Dynamic Load BalancingDynamic Load Balancing• To achieve best performance of a parallel computing system running

a parallel problem, it’s essential to maximize processor utilization bydistributing the computation load evenly or balancing the loadamong the available processors.

• Optimal static load balancing, optimal mapping or scheduling, is anintractable NP-complete problem, except for specific problems onspecific networks.

• Hence heuristics are usually used to select processors for processes.

• Even the best static mapping may offer the best execution time due tochanging conditions at runtime and the process may need to donedynamically.

• The methods used for balancing the computational load dynamicallyamong processors can be broadly classified as:

1. Centralized dynamic load balancing.

2. Decentralized dynamic load balancing.


Processor Load BalanceProcessor Load Balance& Performance& Performance


Dynamic Tasking with Task QueuesDynamic Tasking with Task QueuesCentralized versus distributed queues.

Task stealing with distributed queues.– Can compromise communication and locality, and increase

synchronization.– Whom to steal from, how many tasks to steal, ...– Termination detection– Maximum imbalance related to size of task

QQ 0 Q2Q1 Q3

All remove tasks

P0 inserts P1 inserts P2 inserts P3 inserts

P0 removes P1 removes P2 removes P3 removes

(b) Distributed task queues (one per pr ocess)

Others maysteal

All processesinsert tasks

(a) Centralized task queue


Implications of Load BalancingImplications of Load BalancingExtends speedup limit expression to:

Speedupproblem(p) ≤

Generally, responsibility of software

Architecture can support task stealing and synch efficiently– Fine-grained communication, low-overhead access to queues

• Efficient support allows smaller tasks, better load balancing

– Naming logically shared data in the presence of task stealing• Need to access data of stolen tasks, esp. multiply-stolen tasks

=> Hardware shared address space advantageous

– Efficient support for point-to-point communication

Sequential Work

Max (Work + Synch Wait Time)


Reducing Inherent CommunicationReducing Inherent Communication

Measure: communication to computation ratio

Focus here is on inherent communication– Determined by assignment of tasks to processes

– Actual communication can be greater

• Assign tasks that access same data to same process

• Optimal solution to reduce communication andachive an optimal load balance is NP-hard in thegeneral case

• Simple heuristic solutions work well in practice:– Due to specific structure of applications.


Implications of Communication-to-Implications of Communication-to-Computation RatioComputation Ratio

• Architects must examine application needs

• If denominator is execution time, ratio gives average BW needs

• If operation count, gives extremes in impact of latency andbandwidth– Latency: assume no latency hiding– Bandwidth: assume all latency hidden– Reality is somewhere in between

• Actual impact of communication depends on structure andcost as well:

– Need to keep communication balanced across processors aswell.

Sequential Work

Max (Work + Synch Wait Time + Comm Cost)Speedup <


Reducing Extra Work (Overheads)Reducing Extra Work (Overheads)• Common sources of extra work:

– Computing a good partition e.g. partitioning in Barnes-Hut or sparse matrix

– Using redundant computation to avoid communication

– Task, data and process management overhead• Applications, languages, runtime systems, OS

– Imposing structure on communication• Coalescing messages, allowing effective naming

• Architectural Implications:– Reduce need by making communication and orchestration

efficient

Sequential Work

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <


Extended Memory-Hierarchy View ofExtended Memory-Hierarchy View ofMultiprocessorsMultiprocessors

• Levels in extended hierarchy:– Registers, caches, local memory, remote memory

(topology)

– Glued together by communication architecture

– Levels communicate at a certain granularity of datatransfer

• Need to exploit spatial and temporal locality in hierarchy– Otherwise extra communication may also be caused

– Especially important since communication is expensive


Extended HierarchyExtended Hierarchy• Idealized view: local cache hierarchy + single main memory

• But reality is more complex:

– Centralized Memory: caches of other processors

– Distributed Memory: some local, some remote; + networktopology

– Management of levels:• Caches managed by hardware• Main memory depends on programming model

– SAS: data movement between local and remote transparent

– Message passing: explicit

– Improve performance through architecture or programlocality

– Tradeoff with parallelism; need good node performance andparallelism


ArtifactualArtifactual Communication in Extended Hierarchy Communication in Extended Hierarchy

Accesses not satisfied in local portion cause communication– Inherent communication, implicit or explicit, causes

transfers• Determined by program

– Artifactual communication:

• Determined by program implementation and arch. interactions• Poor allocation of data across distributed memories• Unnecessary data in a transfer• Unnecessary transfers due to system granularities• Redundant communication of data• finite replication capacity (in cache or main memory)

– Inherent communication assumes unlimited capacity, smalltransfers, perfect knowledge of what is needed.

– More on artifactual communication later; first considerreplication-induced further


Structuring CommunicationStructuring CommunicationGiven amount of comm (inherent or artifactual), goal is to reduce cost

• Cost of communication as seen by process:

C = f * ( o + l + + tc - overlap)

• f = frequency of messages• o = overhead per message (at both ends)

• l = network delay per message• nc = total data sent• m = number of messages

• B = bandwidth along path (determined by network, NI, assist)• tc = cost induced by contention per message• overlap = amount of latency hidden by overlap with comp. or comm.

– Portion in parentheses is cost of a message (as seen by processor)

– That portion, ignoring overlap, is latency of a message

– Goal: reduce terms in latency and increase overlap

nc/mB


Reducing OverheadReducing Overhead• Can reduce no. of messages m or overhead per message o

• o is usually determined by hardware or system software– Program should try to reduce m by coalescing messages

– More control when communication is explicit

• Coalescing data into larger messages:– Easy for regular, coarse-grained communication

– Can be difficult for irregular, naturally fine-grainedcommunication

• May require changes to algorithm and extra work– coalescing data and determining what and to whom to send

• Will discuss more in implications for programming modelslater


Reducing Network DelayReducing Network Delay• Network delay component = f*h*th

• h = number of hops traversed in network

• th = link+switch latency per hop

• Reducing f: Communicate less, or make messages larger

• Reducing h:

– Map communication patterns to network topology e.g. nearest-neighbor on mesh and ring; all-to-all

– How important is this?

• Used to be a major focus of parallel algorithms

• Depends on no. of processors, how th, compares with othercomponents

• Less important on modern machines– Overheads, processor count, multiprogramming


Overlapping CommunicationOverlapping Communication• Cannot afford to stall for high latencies

• Overlap with computation or communication to hidelatency

• Requires extra concurrency (slackness), higherbandwidth

• Techniques:

– Prefetching

– Block data transfer

– Proceeding past communication

– Multithreading


Summary of TradeoffsSummary of Tradeoffs• Different goals often have conflicting demands

– Load Balance• Fine-grain tasks

• Random or dynamic assignment

– Communication• Usually coarse grain tasks

• Decompose to obtain locality: not random/dynamic

– Extra Work• Coarse grain tasks

• Simple assignment

– Communication Cost:• Big transfers: amortize overhead and latency

• Small transfers: reduce contention


Relationship Between PerspectivesRelationship Between Perspectives

Synch wait

Data-remote

Data-localOrchestration

Busy-overheadExtra work

Performance issueParallelization step(s) Processor time component

Decomposition/assignment/orchestration

Decomposition/assignment

Decomposition/assignment

Orchestration/mapping

Load imbalance and synchronization

Inherent communication volume

Artifactual communication and data locality

Communication structure


SummarySummarySpeedupprob(p) =

– Goal is to reduce denominator components

– Both programmer and system have role to play

– Architecture cannot do much about load imbalance or toomuch communication

– But it can:

• reduce incentive for creating ill-behaved programs (efficientnaming, communication and synchronization)

• reduce artifactual communication

• provide efficient naming for flexible assignment

• allow effective overlapping of communication

Busy(1) + Data(1)

Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)


Generic Distributed MemoryGeneric Distributed MemoryOrganizationOrganization

• Network bandwidth?• Bandwidth demand?

– Independent processes?– Communicating processes?

• Latency? O(log2P) increase?• Cost scalability of system?

° ° °

Scalable network

CA

P

$

Switch

M

Switch Switch

Multi-stageinterconnectionnetwork (MIN)?Custom-designed?

Node:O(10) Bus-based SMP

Custom-designed CPU?Node/System integration level?How far? Cray-on-a-Chip? SMP-on-a-Chip?

OS Supported?Network protocols?

Communication AssistExtend of functionality?

MessagetransactionDMA?

Global virtual Shared address space?

parallel computer architecture - rochester institute of...

Documents