parallel computing - slides 3

78
1 CS F422 Parallel Computing BITS Pilani, Dubai Campus

Upload: independent

Post on 26-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

1

CS F422

Parallel Computing

BITS Pilani, Dubai Campus

2

Taxonomy - Overview• Flynn’s & Johnson Classification

3

Parallel Computing Platforms• Architecture of Parallel Computing systems

• Organization of processor• Organization of memory• Communication between processor/memory• Disk I/O..

• Processors fetch instructions and data from memory for processing.

• So classify the parallel computing systems based on the how processors and memory systems are interconnected.

• Shared Address Space• Message Passing Systems

4

Parallel Computing Platforms• Don’t take the distinction between multiprocessor

and multicomputer system too serious.• Shared (Single) Address Space

• All processors have access to the entire memory within the system (0 to M-1 locations)

• Multiprocessor System (Tightly coupled)• Based on the time for a processor to access the shared

memory location• Uniform Memory Access (UMA): time taken is the same

for all processors• Memory is central, Symmetric Multiprocessing (SMP)

• Non-uniform Memory Access(NUMA): time taken is different for different processors

• Local and global Memory

5

Parallel Computing Platforms• Shared (Single) Address Space

• UMA/NUMA • How are the processors and memory interconnected• How is shared data is accessed in a mutually

exclusive way.• How is data consistency maintained

• Coherence protocols.• Scalability Considerations• UMA : (Shared-Memory Systems)

• Processor can have caches for efficient access• NUMA : (Distributed Shared Memory Systems)

• Processor have caches, local memory• Access to another processor (global) memory is by

hardware mechanisms.

6

Parallel Computing Platforms• Message passing Architecture/System

• Each unit is a full-fledged system in itself.• Processor, Memory, Network interface Card (NIC),

Disk etc• Loosely Coupled System (multi-computer)• Distributed Memory System• Processor access the memory

• Can access its own memory• Cannot access memory in another processor

• Passes messages , using send(), receive() (eg. tcp) for requesting for data from other system

• Any data consistency problems?• Scalability considerations.

7

Interconnection Networks• One most important factor which distinguishes

parallel systems.• Interconnection network provides mechanisms

for data transfer between processors and memory.

• Made of switches and links connecting the node/switches to other nodes/switches

• Two types : Static (Direct) & Dynamic (Indirect)• Based on how nodes and switches are

integrated.

8

Interconnection Networks• Static (Direct)

• Point-to-point communication links b/w processing nodes & its static

• Limited by dimensionality of network.• Scalability• 1D, 2D (Mesh), Hypercube

• Dynamic (Indirect)• Nodes are connected to switches.• Switches are also connected amongst them selves.• Cascading of switch• Better scalability • MIN (Multistate Interconnection Network)

9

Interconnection Networks

10

Interconnection Networks

11

Interconnection Networks• Switches

• Connect input port to output port.• Degree of the switch: no of ports on the switch• Internal buffering : • Memory to store packets if the output port is busy

• Routing : • Nodes may not be connected directly• To direct packets from one node to other

• Multicast :• To provide same output on all nodes.

• Complexity of a switch (VLSI issue)• How many ports can be provided

12

Interconnection Networks

13

I N/W: Network Topologies• Bus Based, Cross Bar, Multistage, Tree based• Bus-Based

• All processors access a common bus for exchanging data.

• Costs (bus interfaces) increases linearly with no of nodes

• The distance between any two nodes is O(1) in a bus. The bus also provides a convenient broadcast media.

• However, the bandwidth of the shared bus is a major bottleneck.

14

I N/W: Network Topologies

• Cache/Local Memory. Improves performance of bus-based machines – Example 2.12

15

I N/W: Network Topologies• Cache/Local Memory. Improves performance of bus-

based machines – Example 2.12 (TB)• No of proc = p• No of data item/proc = k• Time per data acces = t• So, min exec time (ta) = p*k*t (i.e. without cache)

• With cache assume 50% (of k) is access to local data (access time = t (same as global mem)).

• So, min exec time (tb) = 0.5*k*t + 0.5*k*p*t

• Improvement time = ta/tb = (p*k*t)/(0.5*k*t + 0.5*k*p*t)• For large p. 0.5*k*t << 0.5*k*p*t.• So improvement time = 50%.

16

I N/W: Network Topologies 32 Bit RISC proc , 150 Mhz, 1 CPI (Avg),15% Load, 10%

Store 0.95 hit rate for R/W to cache. (write through cache) Bus bandwidth = 2GB/s a)How many proc can be supported on the bus ? b)If caches are not there how many procs can be

supported?Ans:a)Compute b/w for 1 proc. (Read bw + Write bw)

No. of trans = (0.15*0.05*150MHz + 0.10*150MHz) Mem B/w (1P) =x = 4 * No of trans Proc = Bus B/w / Mem B/w (1P) = 2G/x = (30 proc) b) No Caches (all rd/wr to mem) ie (0.15+0.10) y = 4 * 0.25 * 150MHz Proc = 2G/y = (16 proc)

17

Network Topologies: Crossbar

• (processors) Px (memory) m : Non-blocking.• Cost α P2 , Difficult to scale

18

Network Topologies: Crossbar

19

Network Topologies: Multistage• Crossbars have excellent performance scalability but poor cost

scalability. • Buses have excellent cost scalability, but poor performance

scalability. • Multistage interconnects strike a compromise between these

extremes.

20

Network Topologies: Multistage• One of the most commonly used multistage interconnects is

the Omega network.• This network consists of log p stages, where p is the number

of inputs/outputs.• At each stage, input i is connected to output j if:.

21

Network Topologies: Multistage• Building block of Omega Network a) Pass-through, b) Cross-

Over

• Complete Network. Complexity = nswitches = p/2 log p.

22

Network Topologies: Multistage• Routing• Let s be the binary representation of the source and d be

that of the destination processor.

• The data traverses the link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-through mode by the switch else, it switches to crossover.

• This process is repeated for each of the log p switching stages.

• this is not a non-blocking switch (as compared to crossbar)

.

23

Network Topologies: Multistage• Routing• Let s be the binary representation of the source and d be

that of the destination processor.

• The data traverses the link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-through mode by the switch else, it switches to crossover.

• This process is repeated for each of the log p switching stages.

• this is not a non-blocking switch (as compared to crossbar)

.

24

Network Topologies: Completely Connected & Star Connected

(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes.

Every node is connected only to a common node at the center.

Distance between any pair of nodes is O(1). However, the central node becomes a bottleneck.

In this sense, star connected networks are static counterparts of buses.

Network Topologies: Star Connected Network

Network Topologies: Linear Arrays, Meshes, and k-d Meshes

In a linear array, each node has two neighbors, one to its left and one to its right. If the nodes at either end are connected, we refer to it as a 1-D torus or a ring.

A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west.

A further generalization to d dimensions has nodes with 2d neighbors.

A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes.

Network Topologies: Linear Arrays

Linear arrays: (a) with no wraparound links; (b) with wraparound link.

Network Topologies: Linear Arrays (Rings)

Network Topologies: Linear Arrays (Rings)

Network Topologies: Two- and Three Dimensional Meshes

Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and

(c) a 3-D mesh with no wraparound.

Network Topologies: Two- Mesh

Network Topologies: 2D Torus

Network Topologies: 3D Torus (Cray)

Network Topologies: 3D Torus (Cray)

Network Topologies: Hypercubes and their Construction

Construction of hypercubes from hypercubes of lower dimension.

Network Topologies: Properties of Hypercubes

The distance between any two nodes is at most log p.

Each node has log p neighbors.The distance between two nodes

is given by the number of bit positions at which the two nodes differ.

Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.

Network Topologies: Tree Properties The distance between any two nodes

is no more than 2logp. Links higher up the tree potentially

carry more traffic than those at the lower levels.

For this reason, a variant called a fat-tree, fattens the links as we go up the tree.

Trees can be laid out in 2D with no wire crossings. This is an attractive property of trees.

Network Topologies: Fat Trees

A fat tree network of 16 processing nodes.

Evaluating Static Interconnection Networks Diameter: The distance between the farthest two

nodes in the network. The diameter of a linear array is p − 1, that of a mesh is 2( − 1), that of a tree and hypercube is log p, and that of a completely connected network is O(1).

Bisection Width: The minimum number of wires you must cut to divide the network into two equal parts. The bisection width of a linear array and tree is 1, that of a mesh is , that of a hypercube is p/2 and that of a completely connected network is p2/4.

Cost: The number of links or switches (whichever is asymptotically higher) is a meaningful measure of the cost. However, a number of other factors, such as the ability to layout the network, the length of wires, etc., also factor in to the cost.

Evaluating Static Interconnection Networks

Network Diameter BisectionWidth

Arc Connectivity

Cost (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube

Evaluating Dynamic Interconnection Networks

Network Diameter Bisection Width

Arc Connectivity

Cost (No. of links)

Crossbar

Omega Network

Dynamic Tree

43

Cache Coherence in Multiprocessors• Multiprocessor Systems

• Shared Memory Systems• Interconnection Network provides a way for any

processor to access any shared memory location

• What are the implications w.r.t. to multiple copies of data • Multiple copies of data exits• Multiple processor read the same shared memory

location (variable) and modify as demanded by the program

• Teacher example: • different teachers evaluate different que/ans• Need to accumulate these individual marks into 1

shared variable Total_Marks.

.

44

Cache Coherence in Multiprocessors

45

Cache Coherence in Multiprocessors

46

Cache Coherence in Multiprocessors• Data Consistency

• Consistency of data is a requirement • Contract between the program and system.• Programs demand a consistent view of data.• But it many not happen (as in the above eg.)

because different processor working with shared data may have stale data. (inconsistent data),

• Systems should provide a consistent view of data..• How : Cache coherence..• Cache Coherence is a mechanism implemented at

the hardware level by the which consistency of data can be ensured

• i.e programs see a consistent view of data.

.

47

Cache Coherence in Multiprocessors

48

3-state Invalidate Based CC-Protocol• 3 States; Shared,

Dirty and Invalid• Processor Actions

• Solid lines• CC Actions

• Dashed lines

49

3-state Invalidate Based CC-Protocol

50

3-state MSI write-back invalidation protocol

51

3-state MSI Transition Diagram

52

Implementations of CC (Snoopy CC)• Snoopy CC Systems

• Suitable for broadcast networks (bus or ring)

. All operations are done locally on the cached data• States: dirty, Shared

•Multiple processors reading and updating• Invalidate to shared generates traffic.

•Shared bus can become a bottleneck

53

Cache Coherence in Multiprocessors• Trade Offs. (Update and Invalidate)• If a value is read once by a processor, update may

generate overheads – more memory bus traffic• Invalidate – only 1 memory transaction to invalidate the

memory copy• Usually invalidate protocol is preferred.• False sharing.

• Cache allocation is in blocks• Different variables (not shared at program level)

could be allocated to same cache line. System does not know this.

• Cache lines are fetched from the other processors even if the no shared variable is not updated

• Ping-pong effect of cache-line

.

54

Implementations of CC (Snoopy CC)

55

Invalidate vs update

56

Invalidate vs update

57

False Sharing (Demo).

58

Implementations of CC (Dir based)• Directory Based CC systems

• Propagate cache coherence operations to relevant processors

• Memory augmented with Bitmaps • Bitmaps : state and presence bits• Which processor contains which cache line

• Centralized Directory

• Distributed Directory

59

Implementations of CC (Dir based)

60

Directory base CC protocols

61

Distributed Directory

62

Eg. Read Miss to clean line (1)

63

Eg. Read Miss to clean line (2)

64

ccNUMA-Alternatives

65

SGI Altix (2010)

66

SGI Altix (2010)

67

Cache Coherence – Evolution (Intel)

68

Cache Coherence – Evolution (Intel)

69

Knight’s Corner Xeon Phi CC

70

Knight’s Corner Xeon Phi Ring Comm

71

Knight’s Landing Xeon Phi

72

PRAM Machine and Algos

73

PRAM Machine and Algos

74

PRAM Machine and Algos

75

PRAM Machine and AlgosA natural extension of the Random

Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM.

PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors.

Processors share a common clock but may execute different instructions in each cycle.

76

PRAM Machine and AlgosDepending on how simultaneous

memory accesses are handled, PRAMs can be divided into four subclasses. ◦ Exclusive-read, exclusive-write (EREW) PRAM. ◦ Concurrent-read, exclusive-write (CREW) PRAM. ◦ Exclusive-read, concurrent-write (ERCW) PRAM. ◦ Concurrent-read, concurrent-write (CRCW)

PRAM.

77

PRAM Machine and Algos

78

PRAM Machine and AlgosEREW is the weakest modelCREW/CECW PRAM can execute any

EREW algo in the same amount of time.

CRCW-P PRAM is the strongest. CRCW-Comm algo complexity on

CRCW-A/P machine is same.CRCW-A algo complexity on CRCW-P

machine is same.CRCW algo on EREW machine can

take longer time.