models of parallel computation

CSE 160/Berman

Models of Parallel Computation

W+A: Appendix D

“LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993

Alpern, B., L. Carter, and J. Ferrante, ``Modeling Parallel Computers as Memory Hierarchies,'' Programming Models for Massively Parallel

Computers, Giloi, W. K., S. Jahnichen, and B. D. Shriver ed.,

IEEE Press, 1993.

CSE 160/Berman

Computation Models

• Model provides underlying abstraction useful for analysis of costs, design of algorithms

• Serial computational models use RAM or TM as underlying models for algorithm design

CSE 160/Berman

RAM [Random Access Machine]

• unalterable program consisting of optionally labeled instructions.

• memory is composed of a sequence of words, each capable of containing an arbitrary integer.

• an accumulator, referenced implicitly by most instructions.

• a read-only input tape• a write-only output tape

CSE 160/Berman

RAM Assumptions

• We assume– all instructions take the same time to execute– word-length unbounded– the RAM has arbitrary amounts of memory– arbitrary memory locations can be accessed

in the same amount of time

• RAM provides an ideal model of a serial computer for analyzing the efficiency of serial algorithms.

CSE 160/Berman

PRAM [Parallel Random Access Machine]

• PRAM provides an ideal model of a parallel computer for analyzing the efficiency of parallel algorithms.

• PRAM composed of– P unmodifiable programs, each composed of

optionally labeled instructions.– a single shared memory composed of a sequence of

words, each capable of containing an arbitrary integer.

– P accumulators, one associated with each program– a read-only input tape– a write-only output tape

CSE 160/Berman

More PRAM• PRAM is a synchronous, MIMD, shared memory

parallel computer.• Different protocols can be used for reading and

writing shared memory.– EREW (exclusive read, exclusive write)– CREW (concurrent read, exclusive write)– CRCW (concurrent read, concurrent write) -- requires

additional protocol for arbitrating write conflicts

• PRAM can emulate a message-passing machine by logically dividing shared memory into private memories for the P processors.

CSE 160/Berman

Broadcasting on a PRAM

• “Broadcast” can be done on CREW PRAM in O(1):– Broadcaster sends value to shared memory– Processors read from shared memory

CSE 160/Berman

LogP machine model• Model of distributed memory multicomputer• Developed by [Culler, Karp, Patterson, etc.]• Authors tried to model prevailing parallel

architectures (circa 1993).• Machine model represents prevalent MPP

organization:– machine constructed from at most a few thousand

nodes, – each node contains a powerful processor– each node contains substantial memory– interconnection structure has limited bandwidth– interconnection structure has significant latency

CSE 160/Berman

LogP parameters

• L: upper bound on latency incurred by sending a message from a source to a destination

• o: overhead, defined as the time the processor is engaged in sending or receiving a message, during which time it cannot do anything else

• g: gap, defined as the minimum time between consecutive message transmissions or receptions

• P: number of processor/memory modules

CSE 160/Berman

LogP Assumptions• network has finite capacity.

– at most ceiling(L/g) messages can be in transit from any one processor to any other atone time.

• asynchronous communication. – latency and order of messages is unpredictable

• all messages are small• context switching overhead is 0 (not modeled)• multithreading (virtual processes) may be

employed but only up to a limit of L/g virtual processors

CSE 160/Berman

LogP notes• All parameters measured in processor

cycles• Local operations take one cycle• Messages are assumed to be small • LogP was particularly well-suited to

modeling CM-5. Not clear if the same correlation is found with other machines.

CSE 160/Berman

LogP Analysis of PRAM Broadcasting Algorithm

• Algorithm:– Broadcaster sends value to shared memory

(we’ll assume the value is in P0’s memory)– P Processors read from shared memory (other

processors receive messages from P0)

• Time for P0 to send P messages = o + g (P-1)

• Maximum time for other processors to receive messages = o + (P-2)g + o + L + o

CSE 160/Berman

Efficient Broadcasting in LogP Model

Gap includes overhead time so overhead < gap

P0P1P2P3P4P5P6P7

time

og

og

og

oo

o

o o

L

LL

oLL

o og

oo

L

o

L

CSE 160/Berman

Mapping induced by LogP Broadcasting algorithm on

8 processors

242420

22181410

0

P5

P0

P1

P4P6P7

P2P3

P0P1P2P3P4P5P6P7

time

og

og

og

oo

o

o o

L

LL

oLL

o og

oo

L

o

L

CSE 160/Berman

Analysis of LogP Broadcasting Algorithm to 7

Processors• Time to receive one

message from P0 for first processor (P5) is L+2o

• Time to receive message for last processor is max{3g+L+2o, 2g+L+2o, g+2L+4o, 4o+2L, g+4o+2L}=max{3g+L+2o, g+2L+4o}

• Compare to LogP analysis of PRAM Broadcast which is o + (P-2)g + o + L + o = 5g + 3o + L

P0P1P2P3P4P5P6P7

time

og

og

og

oo

o

o o

L

LL

oLL

o og

oo

L

oL

CSE 160/Berman

Scalable Performance

• LogP Broadcast utilizes tree structure to optimize broadcast time

• Tree depends on values of L,o,g,P

• Strategy is much more scalable (and ultimately more efficient) than PRAM Broadcast

242420

22181410

0

P5

P0

P1

P4P6P7

P2P3

CSE 160/Berman

Moral

• Analysis can be no better than underlying model. The more accurate the model, the more accurate the analysis.

• (This is why we use TM to determine undecidability but RAM to determine complexity.)

CSE 160/Berman

Other Models used for Analysis

• BSP (Bulk Synchronous Parallel)– Slight precursor and competitor to

LogP

• PMH (Parallel Memory Hierarchy)– Focuses on memory costs

CSE 160/Berman

BSP[Bulk Synchronous Parallel]

• BSP proposed by Valiant• BSP model consists of

– P processors, each with local memory– Communication network for point-to-

point message passing between processors

– Mechanism for synchronizing all or some of the processors at defined intervals

CSE 160/Berman

BSP Programs• BSP programs composed of

supersteps• In each superstep, processors

execute L computational steps using locally stored data, and send and receive messages

• Processors synchronized at the end of the superstep (at which time all messages have been received)

• BSP programs can be implemented through mechanisms like Oxford BSP library (C routines for implementing BSP programs) and BSP-L.

superstep

synchronization

superstep

synchronization

CSE 160/Berman

BSP Parameters• P: number of processors (with

memory)• L: synchronization periodicity• g: communication cost• s: processor speed (measured

in number of time steps/second)

• Processor sends at most h messages and receives at most h messages in a single superstep (communication called an h-relation)

superstep

synchronization

superstep

synchronization

CSE 160/Berman

BSP Notes• Complete program = set of supersteps• Communication startup not modeled, g is for

continuous traffic conditions• Message size is one data word• More than one process or thread can be

executed by a processor.• Generally assumed that computation and

communication are not overlapped• Time for a superstep = max number of local

operations performed by any processor + g(max number of messages sent or received by a processor) + L

CSE 160/Berman

BSP Analysis of PRAM Broadcast

• Algorithm:– Broadcaster sends value to shared memory (we’ll

assume the value is in P0’s memory)– P Processors read from shared memory (other

processors receive messages from P0)

• In BSP model, processors only allowed to send or receive at most h messages in a single superstep. Broadcast for more than h processors would require a tree structure– If there were more than Lh processors, then a tree

broadcast would require more than one superstep.

• How much time does it take for a P processor broadcast?

CSE 160/Berman

BSP Analysis of PRAM Broadcast

• How much time does it take for a P processor broadcast?

…

… …

h-ary tree

CSE 160/Berman

PMH [Parallel Memory Hierarchy] Model

• PMH seeks to represent memory. Goal is to model algorithms so that good decisions can be made about where to allocate data during execution.

• Model represents costs of interprocessor communication and memory hierarchy traffic (e.g. between main memory and disk, between registers and cache).

• Proposed by Carter, Ferrante, Alpern

CSE 160/Berman

PMH Model• Computer is modeled as a tree of memory

modules with the processors at the leaves. • All data movement takes the form of block

transfers between children and their parents.

• PMH is composed of a tree of modules– all modules hold data– leaf modules also perform computation– data in a module is partitioned into blocks– Each module has 4 parameters for each

module

CSE 160/Berman

Un-parameterized PMH Models for a Cluster of Workstations

Bandwidth from processor to disk> bandwidth from processor to network

Bandwidth between 2 processors> bandwidth to disk

network

Mainmemories

DisksDisksDisks

DisksDisksCaches

ALU/registers

Mainmemories

Shareddisk

system

DisksDisksCaches

network

ALU/registers

CSE 160/Berman

PMH Module Parameters

• Blocksize s_m tells how many bytes there are per block of m

• Blockcount n_m tells how many blocks fit in m• Childcount c_m tells how many children m has• Transfer time t_m tells how many cycles it

takes to transfer a block between m and its parent

• Size of "node" and length of "edge" in PMH graph should correspond to blocksize, blockcount and transfer time

• Generally all modules at a given level of the tree will have the same parameters

CSE 160/Berman

Summary

• Goal of parallel computation models is to provide a realistic representation of the costs of programming.

• Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient)

• Next up: Mapping and Scheduling

models of parallel computation

Documents