models of parallel computation
DESCRIPTION
Models of Parallel Computation. W+A: Appendix D “LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993 Alpern, B., L. Carter, and J. Ferrante, ``Modeling Parallel Computers as Memory Hierarchies,'' Programming Models for Massively Parallel Computers, - PowerPoint PPT PresentationTRANSCRIPT
CSE 160/Berman
Models of Parallel Computation
W+A: Appendix D
“LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993
Alpern, B., L. Carter, and J. Ferrante, ``Modeling Parallel Computers as Memory Hierarchies,'' Programming Models for Massively Parallel
Computers, Giloi, W. K., S. Jahnichen, and B. D. Shriver ed.,
IEEE Press, 1993.
CSE 160/Berman
Computation Models
• Model provides underlying abstraction useful for analysis of costs, design of algorithms
• Serial computational models use RAM or TM as underlying models for algorithm design
CSE 160/Berman
RAM [Random Access Machine]
• unalterable program consisting of optionally labeled instructions.
• memory is composed of a sequence of words, each capable of containing an arbitrary integer.
• an accumulator, referenced implicitly by most instructions.
• a read-only input tape• a write-only output tape
CSE 160/Berman
RAM Assumptions
• We assume– all instructions take the same time to execute– word-length unbounded– the RAM has arbitrary amounts of memory– arbitrary memory locations can be accessed
in the same amount of time
• RAM provides an ideal model of a serial computer for analyzing the efficiency of serial algorithms.
CSE 160/Berman
PRAM [Parallel Random Access Machine]
• PRAM provides an ideal model of a parallel computer for analyzing the efficiency of parallel algorithms.
• PRAM composed of– P unmodifiable programs, each composed of
optionally labeled instructions.– a single shared memory composed of a sequence of
words, each capable of containing an arbitrary integer.
– P accumulators, one associated with each program– a read-only input tape– a write-only output tape
CSE 160/Berman
More PRAM• PRAM is a synchronous, MIMD, shared memory
parallel computer.• Different protocols can be used for reading and
writing shared memory.– EREW (exclusive read, exclusive write)– CREW (concurrent read, exclusive write)– CRCW (concurrent read, concurrent write) -- requires
additional protocol for arbitrating write conflicts
• PRAM can emulate a message-passing machine by logically dividing shared memory into private memories for the P processors.
CSE 160/Berman
Broadcasting on a PRAM
• “Broadcast” can be done on CREW PRAM in O(1):– Broadcaster sends value to shared memory– Processors read from shared memory
CSE 160/Berman
LogP machine model• Model of distributed memory multicomputer• Developed by [Culler, Karp, Patterson, etc.]• Authors tried to model prevailing parallel
architectures (circa 1993).• Machine model represents prevalent MPP
organization:– machine constructed from at most a few thousand
nodes, – each node contains a powerful processor– each node contains substantial memory– interconnection structure has limited bandwidth– interconnection structure has significant latency
CSE 160/Berman
LogP parameters
• L: upper bound on latency incurred by sending a message from a source to a destination
• o: overhead, defined as the time the processor is engaged in sending or receiving a message, during which time it cannot do anything else
• g: gap, defined as the minimum time between consecutive message transmissions or receptions
• P: number of processor/memory modules
CSE 160/Berman
LogP Assumptions• network has finite capacity.
– at most ceiling(L/g) messages can be in transit from any one processor to any other atone time.
• asynchronous communication. – latency and order of messages is unpredictable
• all messages are small• context switching overhead is 0 (not modeled)• multithreading (virtual processes) may be
employed but only up to a limit of L/g virtual processors
CSE 160/Berman
LogP notes• All parameters measured in processor
cycles• Local operations take one cycle• Messages are assumed to be small • LogP was particularly well-suited to
modeling CM-5. Not clear if the same correlation is found with other machines.
CSE 160/Berman
LogP Analysis of PRAM Broadcasting Algorithm
• Algorithm:– Broadcaster sends value to shared memory
(we’ll assume the value is in P0’s memory)– P Processors read from shared memory (other
processors receive messages from P0)
• Time for P0 to send P messages = o + g (P-1)
• Maximum time for other processors to receive messages = o + (P-2)g + o + L + o
CSE 160/Berman
Efficient Broadcasting in LogP Model
Gap includes overhead time so overhead < gap
P0P1P2P3P4P5P6P7
time
og
og
og
oo
o
o o
L
LL
oLL
o og
oo
L
o
L
CSE 160/Berman
Mapping induced by LogP Broadcasting algorithm on
8 processors
242420
22181410
0
P5
P0
P1
P4P6P7
P2P3
P0P1P2P3P4P5P6P7
time
og
og
og
oo
o
o o
L
LL
oLL
o og
oo
L
o
L
CSE 160/Berman
Analysis of LogP Broadcasting Algorithm to 7
Processors• Time to receive one
message from P0 for first processor (P5) is L+2o
• Time to receive message for last processor is max{3g+L+2o, 2g+L+2o, g+2L+4o, 4o+2L, g+4o+2L}=max{3g+L+2o, g+2L+4o}
• Compare to LogP analysis of PRAM Broadcast which is o + (P-2)g + o + L + o = 5g + 3o + L
P0P1P2P3P4P5P6P7
time
og
og
og
oo
o
o o
L
LL
oLL
o og
oo
L
oL
CSE 160/Berman
Scalable Performance
• LogP Broadcast utilizes tree structure to optimize broadcast time
• Tree depends on values of L,o,g,P
• Strategy is much more scalable (and ultimately more efficient) than PRAM Broadcast
242420
22181410
0
P5
P0
P1
P4P6P7
P2P3
CSE 160/Berman
Moral
• Analysis can be no better than underlying model. The more accurate the model, the more accurate the analysis.
• (This is why we use TM to determine undecidability but RAM to determine complexity.)
CSE 160/Berman
Other Models used for Analysis
• BSP (Bulk Synchronous Parallel)– Slight precursor and competitor to
LogP
• PMH (Parallel Memory Hierarchy)– Focuses on memory costs
CSE 160/Berman
BSP[Bulk Synchronous Parallel]
• BSP proposed by Valiant• BSP model consists of
– P processors, each with local memory– Communication network for point-to-
point message passing between processors
– Mechanism for synchronizing all or some of the processors at defined intervals
CSE 160/Berman
BSP Programs• BSP programs composed of
supersteps• In each superstep, processors
execute L computational steps using locally stored data, and send and receive messages
• Processors synchronized at the end of the superstep (at which time all messages have been received)
• BSP programs can be implemented through mechanisms like Oxford BSP library (C routines for implementing BSP programs) and BSP-L.
superstep
synchronization
superstep
synchronization
CSE 160/Berman
BSP Parameters• P: number of processors (with
memory)• L: synchronization periodicity• g: communication cost• s: processor speed (measured
in number of time steps/second)
• Processor sends at most h messages and receives at most h messages in a single superstep (communication called an h-relation)
superstep
synchronization
superstep
synchronization
CSE 160/Berman
BSP Notes• Complete program = set of supersteps• Communication startup not modeled, g is for
continuous traffic conditions• Message size is one data word• More than one process or thread can be
executed by a processor.• Generally assumed that computation and
communication are not overlapped• Time for a superstep = max number of local
operations performed by any processor + g(max number of messages sent or received by a processor) + L
CSE 160/Berman
BSP Analysis of PRAM Broadcast
• Algorithm:– Broadcaster sends value to shared memory (we’ll
assume the value is in P0’s memory)– P Processors read from shared memory (other
processors receive messages from P0)
• In BSP model, processors only allowed to send or receive at most h messages in a single superstep. Broadcast for more than h processors would require a tree structure– If there were more than Lh processors, then a tree
broadcast would require more than one superstep.
• How much time does it take for a P processor broadcast?
CSE 160/Berman
BSP Analysis of PRAM Broadcast
• How much time does it take for a P processor broadcast?
…
… …
h-ary tree
CSE 160/Berman
PMH [Parallel Memory Hierarchy] Model
• PMH seeks to represent memory. Goal is to model algorithms so that good decisions can be made about where to allocate data during execution.
• Model represents costs of interprocessor communication and memory hierarchy traffic (e.g. between main memory and disk, between registers and cache).
• Proposed by Carter, Ferrante, Alpern
CSE 160/Berman
PMH Model• Computer is modeled as a tree of memory
modules with the processors at the leaves. • All data movement takes the form of block
transfers between children and their parents.
• PMH is composed of a tree of modules– all modules hold data– leaf modules also perform computation– data in a module is partitioned into blocks– Each module has 4 parameters for each
module
CSE 160/Berman
Un-parameterized PMH Models for a Cluster of Workstations
Bandwidth from processor to disk> bandwidth from processor to network
Bandwidth between 2 processors> bandwidth to disk
network
Mainmemories
DisksDisksDisks
DisksDisksCaches
ALU/registers
Mainmemories
Shareddisk
system
DisksDisksCaches
network
ALU/registers
CSE 160/Berman
PMH Module Parameters
• Blocksize s_m tells how many bytes there are per block of m
• Blockcount n_m tells how many blocks fit in m• Childcount c_m tells how many children m has• Transfer time t_m tells how many cycles it
takes to transfer a block between m and its parent
• Size of "node" and length of "edge" in PMH graph should correspond to blocksize, blockcount and transfer time
• Generally all modules at a given level of the tree will have the same parameters
CSE 160/Berman
Summary
• Goal of parallel computation models is to provide a realistic representation of the costs of programming.
• Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient)
• Next up: Mapping and Scheduling