parallel computing platforms

44
Sahalu Junaidu ICS 573: High Performance Computing 2.1 Parallel Computing Platforms Motivation: High Performance Computing Dichotomy of Parallel Computing Platforms Communication Model of Parallel Platforms Physical Organization of Parallel Platforms Communication Costs in Parallel Machines

Upload: denis

Post on 16-Jan-2016

98 views

Category:

Documents


0 download

DESCRIPTION

Parallel Computing Platforms. Motivation: High Performance Computing Dichotomy of Parallel Computing Platforms Communication Model of Parallel Platforms Physical Organization of Parallel Platforms Communication Costs in Parallel Machines. High Performance Computing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.1

Parallel Computing Platforms

• Motivation: High Performance Computing

• Dichotomy of Parallel Computing Platforms

• Communication Model of Parallel Platforms

• Physical Organization of Parallel Platforms

• Communication Costs in Parallel Machines

Page 2: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.2

High Performance Computing

• Computing power required to solve, effectively, computationally intensive and/or data intensive problems in science, engineering and other emerging disciplines

• Provided using – Parallel computers and– Parallel programming techniques

• Is this compute power not available otherwise?

Page 3: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.3

Elements of a Parallel Computer

• Hardware– Multiple Processors

– Multiple Memories

– Interconnection Network

• System Software– Parallel Operating System

– Programming Constructs to Express/Orchestrate Concurrency

• Application Software– Parallel Algorithms

• Goal:

Utilize the Hardware, System, & Application Software to either– Achieve Speedup: Tp = Ts/p– Solve problems requiring a large amount of memory.

Page 4: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.4

Dichotomy of Parallel Computing Platforms

• Logical Organization– The user’s view of the machine as it is being presented via its

system software

• Physical Organization– The actual hardware architecture

• Physical Architecture is to a large extent independent of the Logical Architecture

Page 5: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.5

Logical Organization

• An explicitly parallel program must specify concurrency and interaction between concurrent tasks

• That is, there are two critical components of parallel computing, logically:1. Control structure: How to express parallel tasks

2. Communication model: mechanism for specifying interaction

• Parallelism can be expressed at various levels of granularity - from instruction level to processes.

Page 6: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.6

Control Structure of Parallel Platforms

• Processing units in parallel computers either– operate under the centralized control of a single control unit or – work independently.

• If there is a single control unit that dispatches the same instruction to various processors (that work on different data), the model is referred to as single instruction stream, multiple data stream (SIMD).

• If each processor has its own control control unit, each processor can execute different instructions on different data items. This model is called multiple instruction stream, multiple data stream (MIMD).

Page 7: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.7

SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).

Page 8: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.8

SIMD Processors

InstructionStream

Processor

A

Processor

B

Processor

C

Data Inputstream A

Data Inputstream B

Data Inputstream C

Data Outputstream A

Data Outputstream B

Data Outputstream C

Page 9: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.9

MIMD Processors

Processor

A

Processor

B

Processor

C

Data Inputstream A

Data Inputstream B

Data Inputstream C

Data Outputstream A

Data Outputstream B

Data Outputstream C

InstructionStream A

InstructionStream B

InstructionStream C

Page 10: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.10

SIMD & MIMD Processors (cont’d)

• SIMD relies on the regular structure of computations (such as those in image processing). – Require less hardware than MIMD computers (single control

unit).– Require less memory– Are specialized: not suited to all applications.

• In contrast to SIMD processors, MIMD processors can execute different programs on different processors.– Single program multiple data streams (SPMD) executes the

same program on different processors. – SPMD and MIMD are closely related in terms of programming

flexibility and underlying architectural support.

Page 11: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.11

Logical Organization:Communication Model

• There are two primary forms of data exchange between parallel tasks:– Accessing a shared data space and – Exchanging messages.

• Platforms that provide a shared data space are called shared-address-space machines or multiprocessors.

• Platforms that support messaging are also called message passing platforms or multicomputers.

Page 12: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.12

Shared-Address-Space Platforms

• Part (or all) of the memory is accessible to all processors.

• Processors interact by modifying data objects stored in this shared-address-space.

• If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine.

Page 13: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.13

NUMA and UMA Shared-Address-Space Platforms

Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with caches and

memories; (c) Non-uniform-memory-access shared-address-space computer with local

memory only.

Page 14: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.14

NUMA and UMA Shared-Address-Space Platforms

• The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. – NUMA machines require locality from underlying algorithms for

performance.

• Programming these platforms is easier since reads and writes are implicitly visible to other processors.

• However, read-write to shared data must be coordinated • Caches in such machines require coordinated access to

multiple copies. – This leads to the cache coherence problem.

Page 15: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.15

Shared-Address-Space vs.

Shared Memory Machines

• Shared-address-space is as a programming abstraction

• Shared memory is as a physical machine attribute

• It is possible to provide a shared address space using a physically distributed memory– Distributed shared memory machines

• Shared-address-space machines commonly programmed using Pthreads and OpenMP

Page 16: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.16

Message-Passing Platforms

• These platforms comprise of a set of processors and their own (exclusive) memory

• Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers.

• These platforms are programmed using (variants of) send and receive primitives.

• Libraries such as MPI and PVM provide such primitives.

Page 17: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.17

Physical Organization: Interconnection Networks (ICNs)

• Provide processor-to-processor and processor-to-memory connections

• Networks are classified as:– Static– Dynamic

• Static– Consist of a number of point-to-point links

• direct network– Historically used to link processors-to-processors

• distributed-memory• Dynamic

– The network consists of switching elements that the various processors attach to

• indirect network– Historically used to link processors-to-memory

• shared-memory systems

Page 18: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.18

Static and DynamicInterconnection Networks

Classification of interconnection networks: (a) a static network; and (b) a dynamic network.

Page 19: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.19

Network Topologies

Interconnection Networks

Static Dynamic

Bus-based Switch-based1-D 2-D HC

Single Multiple SS MS Crossbar

Page 20: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.20

Network Topologies: Static ICNs

• Static (fixed) interconnection networks are characterized by having fixed paths, unidirectional or bi-directional, between processors.

• Completely connected networks (CCNs): Number of links: O(N2), delay complexity: O(1).

• Limited connected network (LCNs)– Linear arrays– Ring (Loop) networks– Two-dimensional arrays– Tree networks– Cube network

Page 21: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.21

Network Topologies: Dynamic ICNs

• A variety of network topologies have been proposed and implemented:– Bus-based– Crossbar– Multistage– etc

• These topologies tradeoff performance for cost. • Commercial machines often implement hybrids of

multiple topologies for reasons of packaging, cost, and available components.

Page 22: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.22

Network Topologies: Buses

• Shared medium. • Ideal for information broadcast

– Distance between any two nodes is a constant

• Bandwidth of the shared bus is a major bottleneck.

• Local memories can improve performance

• Scalable in terms of cost, unscalable in terms of performance.

Page 23: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.23

Network Topologies: Crossbars

A completely non-blocking crossbar network connecting p processors to b memory banks.

• Uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner.

• The cost of a crossbar of p processors grows as

• .• Scalable in terms of

performance, unscalable in terms of cost

)( 2p

Page 24: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.24

Network Topologies: Multistage Networks

The schematic of a typical multistage interconnection network.

• Strike a compromise between the cost and performance scalability of the Bus and Crossbar networks.

Page 25: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.25

Network Topologies: Multistage Omega Network

• One of the most commonly used multistage interconnects is the Omega network.

• This network consists of log p stages, where p is the number of inputs/outputs.

• At each stage, input i is connected to output j if:

Page 26: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.26

Network Topologies: Multistage Omega Network

Each stage of the Omega network implements a perfect shuffle as follows:

A perfect shuffle interconnection for eight inputs and outputs.

Page 27: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.27

Network Topologies: Completely Connect and Star Networks

(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes.

• Completely connected network is the static counterpart of a Crossbar network– Performance scales very well, the hardware complexity is not

realizable for large values of p.

• Star network is the static counterpart of a Bus network– Central processor is the bottleneck

Page 28: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.28

Network Topologies: Linear Arrays, Meshes, and k-d Meshes

• In a linear array, each node has two neighbors, one to its left and one to its right.

• If the nodes at either end are connected, we refer to it as a 1-D torus or a ring.

• A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west.

• A further generalization to d dimensions has nodes with 2d neighbors.

• A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes.

Page 29: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.29

Network Topologies: Linear Arrays and Meshes

Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and

(c) a 3-D mesh with no wraparound.

Page 30: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.30

Network Topologies: Hypercubes and their Construction

Construction of hypercubes from hypercubes of lower dimension.

Page 31: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.31

Network Topologies: Properties of Hypercubes

• The distance between any two nodes is at most log p.

• Each node has log p neighbors.

• The distance between two nodes is given by the number of bit positions at which the two nodes differ.

Page 32: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.32

Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.

Page 33: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.33

Evaluation Metrics for ICNs

• The following evaluation metrics are the criteria used to characterize the cost and performance of static ICNs

• Diameter– The maximum distance between any two nodes– Smaller the better.

• Connectivity– The minimum number of arcs that must be removed to break it into two

disconnected networks– Larger the better

• Bisection width– The minimum number of arcs that must be removed to partition the network

into two equal halves.– Larger the better

• Cost– The number of links in the network– Smaller the better

Page 34: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.34

Evaluating Static Interconnection Networks

Network Diameter BisectionWidth

Arc Connectivity

Cost (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube

Page 35: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.35

Evaluating Dynamic Interconnection Networks

Network Diameter Bisection Width

Arc Connectivity

Cost (No. of links)

Crossbar

Omega Network

Dynamic Tree

Page 36: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.36

Communication Costs in Parallel Machines

• Along with idling and contention, communication is a major overhead in parallel programs.

• Communication cost dependents on many features including:– Network topology – Data handling – Routing etc

Page 37: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.37

Message Passing Costs in Parallel Computers

• The communication cost of a data-transfer operation depends on:– Start-up time: ts

• add headers/trailer, error-correction, execute the routing algorithm, establish the connection between source & destination

– Per-hop time: th

• time to travel between two directly connected nodes.– node latency

– Per-word transfer time: tw

• 1/channel-width

Page 38: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.38

Store-and-Forward Routing

• A message traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop.

• The total communication cost for a message of size m words to traverse l communication links is

• In most platforms, th is small and the above expression can be approximated by

Page 39: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.39

Routing Techniques

Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and (c) extending the concept to cut-through routing. The shaded regions represent the time that

the message is in transit. The startup time associated with this message transfer is assumed to be zero.

Page 40: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.40

Packet Routing

• Store-and-forward makes poor use of communication resources.

• Packet routing breaks messages into packets and pipelines them through the network.

• Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information.

• The total communication time for packet routing is approximated by:

• The factor tw accounts for overheads in packet headers.

Page 41: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.41

Cut-Through Routing

• Takes the concept of packet routing to an extreme by further dividing messages into basic units called flits.

• Since flits are typically small, the header information must be minimized.

• This is done by forcing all flits to take the same path, in sequence.

• A tracer message first programs all intermediate routers. All flits then take the same route.

• Error checks are performed on the entire message, as opposed to flits.

• No sequence numbers are needed.

Page 42: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.42

Cut-Through Routing

• The total communication time for cut-through routing is approximated by:

• This is identical to packet routing, however, tw is typically much smaller.

Page 43: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.43

Simplified Cost Model for Communicating Messages

• The cost of communicating a message between two nodes l hops away using cut-through routing is given by

• In this expression, th is typically smaller than ts and tw. For this reason, the second term in the RHS does not show, particularly, when m is large.

• Furthermore, it is often not possible to control routing and placement of tasks.

• For these reasons, we can approximate the cost of message transfer by

Page 44: Parallel Computing Platforms

Sahalu Junaidu ICS 573: High Performance Computing 2.44

Notes on the Simplified Cost Model

• The given cost model allows the design of algorithms in an architecture-independent manner

• However, the following assumptions are made:– Communication between any pair of nodes takes equal time– Underlying network is uncongested– Underlying network is completely connected– Cut-through routing is used