intro to parallel processing

56
Introduction to Parallel Processing Shantanu Dutt University of Illinois at Chicago

Upload: pranjal-rastogi

Post on 21-Jul-2016

39 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intro to Parallel Processing

Introduction to Parallel Processing

Shantanu DuttUniversity of Illinois at Chicago

Page 2: Intro to Parallel Processing

2

Acknowledgements Ashish Agrawal, IIT Kanpur, “Fundamentals of Parallel

Processing” (slides), w/ some modifications and augmentations by Shantanu Dutt

John Urbanic, Parallel Computing: Overview (slides), w/ some modifications and augmentations by Shantanu Dutt

John Mellor-Crummey, “COMP 422 Parallel Computing: An Introduction”, Department of Computer Science, Rice University, (slides), w/ some modifications and augmentations by Shantanu Dutt

Page 3: Intro to Parallel Processing

3

Outline

The need for explicit multi-core/processor parallel processing: Moore's Law and its limits Different uni-processor performance enhancement techniques

and their limits Applications for parallel processing

Overview of different applications An example parallel algorithm

Classification of parallel computations Classification of parallel architectures

Including an example of an SPMD parallel algorithm Summary

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Page 4: Intro to Parallel Processing

4

Outline

The need for explicit multi-core/processor parallel processing: Moore's Law and its limits Different uni-processor performance enhancement

techniques and their limits Applications for parallel processing

Overview of different applications An example parallel algorithm

Classification of parallel computations Classification of parallel architectures

Including an example of an SPMD parallel algorithm Summary

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Page 5: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 5

Moore’s Law & Need for Parallel Processing

Chip performance doubles every 18-24 months

Power consumption is prop. to freq.

Limits of Serial computing – Heating issues Limit to transmissions

speeds Leakage currents Limit to miniaturization

Multi-core processors already commonplace.

Most high performance servers already parallel.

Page 6: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 6

Quest for Performance Pipelining Superscalar Architecture Out of Order Execution Caches Instruction Set Design

Advancements Parallelism

Multi-core processors Clusters Grid

This is the future

Page 7: Intro to Parallel Processing

Top text from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur 7

Pipelining Illustration of Pipeline using the fetch, load, execute, store stages. At the start of execution – Wind up. At the end of execution – Wind down. Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect

branch prediction – Hit performance and speedup. Pipeline depth – No of cycles in execution simultaneously. Intel Pentium 4 – 35 stages.

Page 8: Intro to Parallel Processing

• Tpipe(n) is pipelined time to process n instructions = fill-time + n*(max{ti} ~ n*(max{ti} for large n, as fill-time is a constant wrt n), ti = exec. time of the i’th stage.

• This pipelined throughput = 1/max{ti}

8

Pipelining

Page 9: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 9

Cache Desire for fast cheap and non volatile memory Memory speed growth at 7% per annum while processor growth at 50% p.a. Cache – fast small memory. L1 and L2 caches. Retrieval from memory takes several hundred clock cycles Retrieval from L1 cache takes the order of one clock cycle and from L2 cache

takes the order of 10 clock cycles. Cache ‘hit’ and ‘miss’. Prefetch used to avoid cache misses at the start of the execution of the

program. Cache lines used to avoid latency time in case of a cache miss Order of search – L1 cache -> L2 cache -> RAM -> Disk Cache coherency – Correctness of data. Important for distributed parallel

computing Limit to cache improvement: Improving cache performance will at most improve

efficiency to match processor efficiency

Page 10: Intro to Parallel Processing

10

(exs. of limited data parallelism)

(exs. of limited & low-level functional parallelism)

(single-instr.multiple data)

: instruction-level parallelism—degree generally low and dependent on how the sequential code has been written, so not v. effective

Page 11: Intro to Parallel Processing

11

Page 12: Intro to Parallel Processing

12

Page 13: Intro to Parallel Processing

13

Page 14: Intro to Parallel Processing

14

Page 15: Intro to Parallel Processing

15

(simultaneous multi- threading)

(multi-threading)

Page 16: Intro to Parallel Processing

16

Page 17: Intro to Parallel Processing

Thus ……: Two Fundamental Issues in Future High Performance

17Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Microprocessor performance improvement via various implicit and explicit parallelism schemes and technology improvements is reaching (has reached?) a point of diminishing returns

Thus need development of explicit parallel algorithms that are based on a fundamental understanding of the parallelism inherent in a problem, and exploiting that parallelism with minimum interaction/communication between the parallel parts

Page 18: Intro to Parallel Processing

18

Outline

The need for explicit multi-core/processor parallel processing: Moore's Law and its limits Different uni-processor performance enhancement techniques

and their limits Applications for parallel processing

Overview of different applications An example parallel algorithm

Classification of parallel computations Classification of parallel architectures

Including an example of an SPMD parallel algorithm Summary

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Page 19: Intro to Parallel Processing

19

Page 20: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 20

Page 21: Intro to Parallel Processing

21

Page 22: Intro to Parallel Processing

22

Page 23: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 23

Applications of Parallel Processing

Page 24: Intro to Parallel Processing

24

Page 25: Intro to Parallel Processing

25

Page 26: Intro to Parallel Processing

26

Page 27: Intro to Parallel Processing

27

Page 28: Intro to Parallel Processing

28

Page 29: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 29

An example parallel algorithm for a finite element computation Easy Parallel Situation – Each data part is

independent. No communication is required between the execution units solving two different parts.

Next Level: Simple, structured and sparse communication needed.

Example: Heat Equation - The initial temperature is zero on the boundaries and

high in the middle The boundary temperature is held at zero. The calculation of an element is dependent upon its

neighbor elements

data1 data2 …... data N

Page 30: Intro to Parallel Processing

Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur 30

1. find out if I am MASTER or WORKER2. if I am MASTER3. initialize array4. send each WORKER starting info and

subarray5. do until all WORKERS converge6. gather from all WORKERS convergence

data7. broadcast to all WORKERS convergence

signal8. end do9. receive results from each WORKER

1. else if I am WORKER2. receive from MASTER starting info and

subarray3. do until solution converged {4. update time5. send (non-blocking?) neighbors my border

info 6. receive (non-blocking?) neighbors border

info7. update interior of my portion of solution

array (see comput. given in the serial code)8. wait for non-block. commun. (if any) to

complete

14. update border of my portion of solution array

15. determine if my solution has converged16. if so {send MASTER convergence signal17. recv. from MASTER convergence signal}18. end do }19. send MASTER results20. endif

Serial Code -do y=2, N-1do x=2, M-1 u2(x,y)=u1(x,y)+cx*[u1(x+1,y) + u1(x-1,y)] + cy*[u1(x,y+1)} + u1(x,y-1)] /* cx, cy are const.enddoenddou1 = u2; Master (can be one of the workers)

Workers ProblemGrid

Page 31: Intro to Parallel Processing

31

Outline

The need for explicit multi-core/processor parallel processing: Moore's Law and its limits Different uni-processor performance enhancement techniques

and their limits Applications for parallel processing

Overview of different applications An example parallel algorithm

Classification of parallel computations Classification of parallel architectures

Including an example of an SPMD parallel algorithm Summary and future advances

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Page 32: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 32

Parallelism - A simplistic understanding Multiple tasks at once. Distribute work into multiple

execution units. A classification of parallelism:

Data Parallelism Functional or Control

Parallelism Data Parallelism - Divide the

dataset and solve each sector “similarly” on a separate execution unit.

Functional Parallelism – Divide the 'problem' into different tasks and execute the tasks on different units. What would func. parallelism look like for the example on the right?

Seq

uent

ial

Dat

a P

aral

lelis

m

Page 33: Intro to Parallel Processing

16/12/2008 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 33

Data Parallelism

Functional Parallelism

Page 34: Intro to Parallel Processing

Flynn’s Classification Flynn's Classical Taxonomy – Based on # of instruction/task and

data streams Single Instruction, Single Data streams (SISD)—your single-core

uni-processor PC Single Instruction, Multiple Data streams (SIMD)—special

purpose low-granularity multi-processor m/c w/ a single control unit relaying the same instruction to all processors (w/ different data) every cc (e.g., nVIDIA graphic co-processor w/ 1000’s of simple cores)

Multiple Instruction, Single Data streams (MISD)—pipelining is a major example

Multiple Instruction, Multiple Data streams (MIMD)—the most prevalent model. SPMD (Single Program Multiple Data) is a very useful subset. Note that this is v. different from SIMD. Why?

Data vs Control Parallelism is another independent classification to Flynn’s

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 34

Page 35: Intro to Parallel Processing

Flynn’s Classification (cont’d).

35

Page 36: Intro to Parallel Processing

Flynn’s Classification (cont’d).

36

Page 37: Intro to Parallel Processing

Flynn’s Classification (cont’d).

37

Page 38: Intro to Parallel Processing

Flynn’s Classification (cont’d).

38

Page 39: Intro to Parallel Processing

Flynn’s Classification (cont’d).

39

Page 40: Intro to Parallel Processing

Flynn’s Classification (cont’d).

40

• Data Parallelism: SIMD and SPMD fall into this category• Functional Parallelism: MISD falls into this category• MIMD can incorporates both data and functional parallelisms

(the latter at either instruction level—different instrs. being executed across the processors at any time, or at the high-level function space)

Page 41: Intro to Parallel Processing

41

Outline

The need for explicit multi-core/processor parallel processing: Moore's Law and its limits Different uni-processor performance enhancement techniques

and their limits Applications for parallel processing

Overview of different applications An example parallel algorithm

Classification of parallel computations Classification of parallel architectures

Including an example of an SPMD parallel algorithm Summary

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Page 42: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 42

Parallel Arch. Classification

Multi-processor Architectures- Distributed Memory—Most prevalent architecture model for # processors > 8

Indirect interconnectionn n/ws Direct interconnection n/ws

Shared Memory Uniform Memory Access (UMA) Non- Uniform Memory Access (NUMA)—Distributed shared memory

1

Page 43: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 43

Distributed Memory—Message Passing Architectures

Each processor P (with its own local cache C) is connected to exclusive local memory, i.e. no other CPU has direct access to it.

Each node comprises at least one network interface (NI) that mediates the connection to a communication network.

On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network.

Non-blocking vs Blocking communication

Direct vs Indirect Communication/Interconnection network

Example: A 2x4 mesh n/w (direct connection n/w)

Page 44: Intro to Parallel Processing

The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)

44

• Has 56 compute nodes/computers and a master node• Master here has a different meaning—generally a system front-end where you login and perform

various tasks before submitting your parallel code to run on several compute nodes—than the “master” node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution problem), which would actually be one of the compute nodes, and generally distributes data to the other compute nodes, monitors progress of the computation, determines the end of the computation, etc., and may also additionally perform a part of the computation

• Compute nodes are divided among 14 zones, each zone containing 4 nodes which are connected as a ring network. Zones are connected to each other by a higher-level n/w.

• Each node (compute or master) has 2 processors. Each processor on some nodes are single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes

1

Page 45: Intro to Parallel Processing

System Computational Actions in a Message-Passing Program

45

(a) Two basic parallel processes X, Y, and their data dependency

a := b+c; b := x*y;

Proc. X Proc. Y

recv(P2, b);/* blocking */a := b+c;

b := x*y;send(P1,b); /* non-blocking */

Proc. X Proc. Y

bP(X) P(Y)

Processor/corecontaining Y

Processor/corecontaining X

Message passingof data item “b”.Link (direct

or indirect) betw.the 2 processors(b) Their mapping to a message-passing multicomputer

Message passing mapping

1

Page 46: Intro to Parallel Processing

Dual-Core Quad-Core

L1 cache

L2 cache

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 46

Distributed Shared Memory Arch.: UMA Flat memory model Memory bandwidth and latency are the same for all

processors and all memory locations. Simplest example – dual core processor Most commonly represented today by Symmetric

Multiprocessor (SMP) machines Cache coherent UMA—consistent cache values of the

same data item in different proc./core caches

1

Page 47: Intro to Parallel Processing

System Computational Actions in a Shared-Memory Program

47

(a) Two basic parallel processes X, Y, and their data dependency

a := b+c; b := x*y;

Proc. X Proc. Y

a := b+c; b := z*w;

Proc. X Proc. Y

P(X) P(Y)

(b) Their mapping to a shared-memory multiprocessor

Shared-memory mapping

Shared Memory

Possible Actions by O.S.:(i)Since “b” is a shared data item (e.g., designated by compiler or programmer), check “b”’s location to see if it can be written to (all prev. reads done: read_cntr for “b” = 0).(ii)If so, write “b” to its location and mark status bit as written by “Y”. Initialize read_cntr for “b” to pre-determined value

Possible Actions by O.S.:(i)Since “b” is a shared data item (e.g., designated by compiler or programmer), check “b”’s location to see if it has been written to by “Y” or any process (if don’t care about the writing process). (ii)If so {read “b” & decrement read_cntr for “b”} else go to (i) and busy wait (check periodically).

1

Page 48: Intro to Parallel Processing

Most text from Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur 48

Distributed Shared Memory Arch.: NUMA Memory is physically distributed but logically shared. The physical layout similar to the distributed-memory message-passing case Aggregated memory of the whole system appear as one single address space. Due to the distributed nature, memory access performance varies depending on which

CPU accesses which parts of memory (“local” vs. “remote” access). Two locality domains linked through a high speed connection called Hyper Transport (in

general via a link, as in message passing arch’s, only here these links are used by the O.S. to transmit read/write non-local data to/from processor/non-local memory).

Advantage – Scalability (compared to UMA’s) Disadvantage – a) Locality Problems and Connection congestion. b) Not a natural

parallel prog./algo. Model (it is easier to partition data among proc’s instead of think of all of it occupying a large monolithic address space that each proc. can access).

all-to-all (complete graph)connection via a combination of direct and indirect conns.

1

Page 49: Intro to Parallel Processing

49

Page 50: Intro to Parallel Processing

50

Page 51: Intro to Parallel Processing

An example of an SPMD message-passing parallel program

51

Page 52: Intro to Parallel Processing

SPMD message-passing parallel program (contd.)

52

node xor D,

1

Page 53: Intro to Parallel Processing

53

How to interconnect the multiple cores/processors is a major consideration in a parallel architecture

Page 54: Intro to Parallel Processing

54

Tflops Tflops kW

1

Page 55: Intro to Parallel Processing

Most text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 55

Summary Serial computers / microprocessors will probably not get much faster -

parallelization unavoidable Pipelining, cache and other optimization strategies for serial computers reaching a

plateau Application examples Data and functional parallelism Flynn’s taxonomy: SIMD, MISD, MIMD/SPMD Parallel Architectures Intro

Distributed Memory Shared Memory

Uniform Memory Access Non Uniform Memory Access

Parallel program/algorithm examples

Page 56: Intro to Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 56

Additional References Computer Organization and Design– Patterson Hennessey Modern Operating Systems – Tanenbaum Concepts of High Performance Computing – Georg Hager

Gerhard Wellein Cramming more components onto Integrated Circuits – Gordon

Moore, 1965 Introduction to Parallel Computing –

https://computing.llnl.gov/tutorials/parallel_comp The Landscape of Parallel Computing Research – A view from

Berkeley, 2006