parallel computing - slides 3
TRANSCRIPT
3
Parallel Computing Platforms• Architecture of Parallel Computing systems
• Organization of processor• Organization of memory• Communication between processor/memory• Disk I/O..
• Processors fetch instructions and data from memory for processing.
• So classify the parallel computing systems based on the how processors and memory systems are interconnected.
• Shared Address Space• Message Passing Systems
4
Parallel Computing Platforms• Don’t take the distinction between multiprocessor
and multicomputer system too serious.• Shared (Single) Address Space
• All processors have access to the entire memory within the system (0 to M-1 locations)
• Multiprocessor System (Tightly coupled)• Based on the time for a processor to access the shared
memory location• Uniform Memory Access (UMA): time taken is the same
for all processors• Memory is central, Symmetric Multiprocessing (SMP)
• Non-uniform Memory Access(NUMA): time taken is different for different processors
• Local and global Memory
5
Parallel Computing Platforms• Shared (Single) Address Space
• UMA/NUMA • How are the processors and memory interconnected• How is shared data is accessed in a mutually
exclusive way.• How is data consistency maintained
• Coherence protocols.• Scalability Considerations• UMA : (Shared-Memory Systems)
• Processor can have caches for efficient access• NUMA : (Distributed Shared Memory Systems)
• Processor have caches, local memory• Access to another processor (global) memory is by
hardware mechanisms.
6
Parallel Computing Platforms• Message passing Architecture/System
• Each unit is a full-fledged system in itself.• Processor, Memory, Network interface Card (NIC),
Disk etc• Loosely Coupled System (multi-computer)• Distributed Memory System• Processor access the memory
• Can access its own memory• Cannot access memory in another processor
• Passes messages , using send(), receive() (eg. tcp) for requesting for data from other system
• Any data consistency problems?• Scalability considerations.
7
Interconnection Networks• One most important factor which distinguishes
parallel systems.• Interconnection network provides mechanisms
for data transfer between processors and memory.
• Made of switches and links connecting the node/switches to other nodes/switches
• Two types : Static (Direct) & Dynamic (Indirect)• Based on how nodes and switches are
integrated.
8
Interconnection Networks• Static (Direct)
• Point-to-point communication links b/w processing nodes & its static
• Limited by dimensionality of network.• Scalability• 1D, 2D (Mesh), Hypercube
• Dynamic (Indirect)• Nodes are connected to switches.• Switches are also connected amongst them selves.• Cascading of switch• Better scalability • MIN (Multistate Interconnection Network)
11
Interconnection Networks• Switches
• Connect input port to output port.• Degree of the switch: no of ports on the switch• Internal buffering : • Memory to store packets if the output port is busy
• Routing : • Nodes may not be connected directly• To direct packets from one node to other
• Multicast :• To provide same output on all nodes.
• Complexity of a switch (VLSI issue)• How many ports can be provided
13
I N/W: Network Topologies• Bus Based, Cross Bar, Multistage, Tree based• Bus-Based
• All processors access a common bus for exchanging data.
• Costs (bus interfaces) increases linearly with no of nodes
• The distance between any two nodes is O(1) in a bus. The bus also provides a convenient broadcast media.
• However, the bandwidth of the shared bus is a major bottleneck.
14
I N/W: Network Topologies
• Cache/Local Memory. Improves performance of bus-based machines – Example 2.12
15
I N/W: Network Topologies• Cache/Local Memory. Improves performance of bus-
based machines – Example 2.12 (TB)• No of proc = p• No of data item/proc = k• Time per data acces = t• So, min exec time (ta) = p*k*t (i.e. without cache)
• With cache assume 50% (of k) is access to local data (access time = t (same as global mem)).
• So, min exec time (tb) = 0.5*k*t + 0.5*k*p*t
• Improvement time = ta/tb = (p*k*t)/(0.5*k*t + 0.5*k*p*t)• For large p. 0.5*k*t << 0.5*k*p*t.• So improvement time = 50%.
16
I N/W: Network Topologies 32 Bit RISC proc , 150 Mhz, 1 CPI (Avg),15% Load, 10%
Store 0.95 hit rate for R/W to cache. (write through cache) Bus bandwidth = 2GB/s a)How many proc can be supported on the bus ? b)If caches are not there how many procs can be
supported?Ans:a)Compute b/w for 1 proc. (Read bw + Write bw)
No. of trans = (0.15*0.05*150MHz + 0.10*150MHz) Mem B/w (1P) =x = 4 * No of trans Proc = Bus B/w / Mem B/w (1P) = 2G/x = (30 proc) b) No Caches (all rd/wr to mem) ie (0.15+0.10) y = 4 * 0.25 * 150MHz Proc = 2G/y = (16 proc)
17
Network Topologies: Crossbar
• (processors) Px (memory) m : Non-blocking.• Cost α P2 , Difficult to scale
19
Network Topologies: Multistage• Crossbars have excellent performance scalability but poor cost
scalability. • Buses have excellent cost scalability, but poor performance
scalability. • Multistage interconnects strike a compromise between these
extremes.
20
Network Topologies: Multistage• One of the most commonly used multistage interconnects is
the Omega network.• This network consists of log p stages, where p is the number
of inputs/outputs.• At each stage, input i is connected to output j if:.
21
Network Topologies: Multistage• Building block of Omega Network a) Pass-through, b) Cross-
Over
• Complete Network. Complexity = nswitches = p/2 log p.
22
Network Topologies: Multistage• Routing• Let s be the binary representation of the source and d be
that of the destination processor.
• The data traverses the link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-through mode by the switch else, it switches to crossover.
• This process is repeated for each of the log p switching stages.
• this is not a non-blocking switch (as compared to crossbar)
.
23
Network Topologies: Multistage• Routing• Let s be the binary representation of the source and d be
that of the destination processor.
• The data traverses the link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-through mode by the switch else, it switches to crossover.
• This process is repeated for each of the log p switching stages.
• this is not a non-blocking switch (as compared to crossbar)
.
24
Network Topologies: Completely Connected & Star Connected
(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes.
Every node is connected only to a common node at the center.
Distance between any pair of nodes is O(1). However, the central node becomes a bottleneck.
In this sense, star connected networks are static counterparts of buses.
Network Topologies: Star Connected Network
Network Topologies: Linear Arrays, Meshes, and k-d Meshes
In a linear array, each node has two neighbors, one to its left and one to its right. If the nodes at either end are connected, we refer to it as a 1-D torus or a ring.
A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west.
A further generalization to d dimensions has nodes with 2d neighbors.
A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes.
Network Topologies: Linear Arrays
Linear arrays: (a) with no wraparound links; (b) with wraparound link.
Network Topologies: Two- and Three Dimensional Meshes
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and
(c) a 3-D mesh with no wraparound.
Network Topologies: Hypercubes and their Construction
Construction of hypercubes from hypercubes of lower dimension.
Network Topologies: Properties of Hypercubes
The distance between any two nodes is at most log p.
Each node has log p neighbors.The distance between two nodes
is given by the number of bit positions at which the two nodes differ.
Network Topologies: Tree-Based Networks
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
Network Topologies: Tree Properties The distance between any two nodes
is no more than 2logp. Links higher up the tree potentially
carry more traffic than those at the lower levels.
For this reason, a variant called a fat-tree, fattens the links as we go up the tree.
Trees can be laid out in 2D with no wire crossings. This is an attractive property of trees.
Evaluating Static Interconnection Networks Diameter: The distance between the farthest two
nodes in the network. The diameter of a linear array is p − 1, that of a mesh is 2( − 1), that of a tree and hypercube is log p, and that of a completely connected network is O(1).
Bisection Width: The minimum number of wires you must cut to divide the network into two equal parts. The bisection width of a linear array and tree is 1, that of a mesh is , that of a hypercube is p/2 and that of a completely connected network is p2/4.
Cost: The number of links or switches (whichever is asymptotically higher) is a meaningful measure of the cost. However, a number of other factors, such as the ability to layout the network, the length of wires, etc., also factor in to the cost.
Evaluating Static Interconnection Networks
Network Diameter BisectionWidth
Arc Connectivity
Cost (No. of links)
Completely-connected
Star
Complete binary tree
Linear array
2-D mesh, no wraparound
2-D wraparound mesh
Hypercube
Wraparound k-ary d-cube
Evaluating Dynamic Interconnection Networks
Network Diameter Bisection Width
Arc Connectivity
Cost (No. of links)
Crossbar
Omega Network
Dynamic Tree
43
Cache Coherence in Multiprocessors• Multiprocessor Systems
• Shared Memory Systems• Interconnection Network provides a way for any
processor to access any shared memory location
• What are the implications w.r.t. to multiple copies of data • Multiple copies of data exits• Multiple processor read the same shared memory
location (variable) and modify as demanded by the program
• Teacher example: • different teachers evaluate different que/ans• Need to accumulate these individual marks into 1
shared variable Total_Marks.
.
46
Cache Coherence in Multiprocessors• Data Consistency
• Consistency of data is a requirement • Contract between the program and system.• Programs demand a consistent view of data.• But it many not happen (as in the above eg.)
because different processor working with shared data may have stale data. (inconsistent data),
• Systems should provide a consistent view of data..• How : Cache coherence..• Cache Coherence is a mechanism implemented at
the hardware level by the which consistency of data can be ensured
• i.e programs see a consistent view of data.
.
48
3-state Invalidate Based CC-Protocol• 3 States; Shared,
Dirty and Invalid• Processor Actions
• Solid lines• CC Actions
• Dashed lines
52
Implementations of CC (Snoopy CC)• Snoopy CC Systems
• Suitable for broadcast networks (bus or ring)
. All operations are done locally on the cached data• States: dirty, Shared
•Multiple processors reading and updating• Invalidate to shared generates traffic.
•Shared bus can become a bottleneck
53
Cache Coherence in Multiprocessors• Trade Offs. (Update and Invalidate)• If a value is read once by a processor, update may
generate overheads – more memory bus traffic• Invalidate – only 1 memory transaction to invalidate the
memory copy• Usually invalidate protocol is preferred.• False sharing.
• Cache allocation is in blocks• Different variables (not shared at program level)
could be allocated to same cache line. System does not know this.
• Cache lines are fetched from the other processors even if the no shared variable is not updated
• Ping-pong effect of cache-line
.
58
Implementations of CC (Dir based)• Directory Based CC systems
• Propagate cache coherence operations to relevant processors
• Memory augmented with Bitmaps • Bitmaps : state and presence bits• Which processor contains which cache line
• Centralized Directory
• Distributed Directory
75
PRAM Machine and AlgosA natural extension of the Random
Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM.
PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors.
Processors share a common clock but may execute different instructions in each cycle.
76
PRAM Machine and AlgosDepending on how simultaneous
memory accesses are handled, PRAMs can be divided into four subclasses. ◦ Exclusive-read, exclusive-write (EREW) PRAM. ◦ Concurrent-read, exclusive-write (CREW) PRAM. ◦ Exclusive-read, concurrent-write (ERCW) PRAM. ◦ Concurrent-read, concurrent-write (CRCW)
PRAM.