1 introduction to parallel computing. 2 multiprocessor architectures message-passing architectures...

34
1 Introduction to Parallel Computing

Upload: kathlyn-brown

Post on 29-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

1

Introduction to Parallel Computing

Page 2: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

2

Multiprocessor Architectures

• Message-Passing Architectures– Separate address space for each processor.– Processors communicate via message passing.

• Shared-Memory Architectures– Single address space shared by all processors.– Processors communicate by memory read/write.– SMP or NUMA.– Cache coherence is important issue.

• Lots of middle ground and hybrids.• No clear consensus on terminology.

Page 3: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

3

Message-Passing Architecture

. . .

processor

cache

memory

processor

cache

memory

processor

cache

memory

interconnection network

. . .

Page 4: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

4

Shared-Memory Architecture

. . .

interconnection network

. . .

processor1

cache

processor2

cache

processorN

cache

memory1

memoryM

memory2

Page 5: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

5

Shared-Memory Architecture:SMP and NUMA

• SMP = Symmetric Multiprocessor– All memory is equally close to all processors.– Typical interconnection network is a shared bus.– Easier to program, but doesn’t scale to many processors.

• NUMA = Non-Uniform Memory Access– Each memory is closer to some processors than others. – a.k.a. “Distributed Shared Memory”.– Typically interconnection is grid or hypercube.– Harder to program, but scales to more processors.

Page 6: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

6

Shared-Memory Architecture:Cache Coherence

• Effective caching reduces memory contention.• Processors must see single consistent memory.• Many different consistency models.• Weak consistency is sufficient.• Snoopy cache coherence for bus-based SMPs.• Distributed directories for NUMA.• Many implementation issues: multiple-levels, I-D

separation, cache line size, update policy, etc. etc.• Usually don’t need to know all the details.

Page 7: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

7

Example: Quad-Processor Pentium Pro

• SMP, bus interconnection.• 4 x 200 MHz Intel Pentium Pro processors.• 8 + 8 Kb L1 cache per processor.• 512 Kb L2 cache per processor.• Snoopy cache coherence.• Compaq, HP, IBM, NetPower.• Windows NT, Solaris, Linux, etc.

Page 8: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

8

100 Mbitswitch

Diplopodus

Node config

2 x 500 MHz Pentium III

512 Mb RAM

12-16 Gb disk

Beowulf-based cluster of Linux/Intel

workstations

24 PCs

Page 9: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

9

The first program

• Purpose: Illustrate notation• Given

– Length of vectors M

– Data xm, ym, m=0,1,…,M-1 of real numbers, and two real scalars and .

• Compute– z = x + y, i.e., z[m] = x[m] + y[m] for m=0,1,…,M-1.

Page 10: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

10

Program Vector_Sum_1

declare

m: integer;

x,y,z: array[0,1,…,M-1] of real;

initially

<; m: 0 m < M :: x[m]=xm, y[m]=ym>

assign

<|| m: 0 m < M :: z[m] = x[m] + y[m]>

end

Here || is a concurrent operator. It means that is two operations O1 and O2 are separated by ||, i.e. O1 || O2, then the two operations can be performed concurrently independently of each other.

In addition,

<|| m:0 m < M :: Om>

is short for O0||O1||…||OM-1 meaning that all the M operations can be done concurrently.

Page 11: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

11

Sequential assignmentinitially

a=1, b=2

assigna:=b; b:=a

results in a=b=2.

Concurrent assignmentinitially

a=1, b=2

assigna:=b || b:=a

results ina=2, b=1.

Page 12: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

12

A model of a parallel computer

• P processors (nodes); p=0,1,…P-1.• All processors are identical.• All processors compute sequentially.• All nodes can communicate with any other node.• The communication is handled by mechanisms for

sending and receiving data at each processor.

Page 13: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

13

Data distribution

• Suppose distribution of vector x with M elements x0,…xM-1 over a collection of P identical computers.

• On each computer define index setJp = {0,1,…Ip-1},

where Ip is the number of indices stored at processor p.

• Assume I0+I1+…+IP-1 = M,

x=(x0,…xI0-1,…,…,…,…,…,xM-1)

stored proc 0 stored proc P-1

Page 14: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

14

• A proper data distribution defines a one-to-one mapping from a global index m to a local index i on processor p.

• For a global index m, (m) gives a unique index i on a unique processor p.

• Similarly, an index i on processor p is uniquely mapped to a unique global index m= -1(p,i).

• Globally: x = x0,…xM-1

• Locally: x0,…xI0-1, x0,…xI1-1,…, x0,…xIP-1-1

proc 0 proc 1 proc P-1

Page 15: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

15

• Purpose:– derive a multicomputer version of Vector_Sum_1

• Given– Length of vectors M.

– Data xm, ym, m=0,1,…,M-1 of real numbers, and two real scalars and .

– Number of processors P.

– Set of indices Jp={0,1,…,Ip-1} where the number of entries Ip on the p-th processor is given.

– A one-to-one mapping between global and local indices.

• Computez= x + y,

i,.e, z[m]= x[m] + y[m] for m=0,1,…,M-1.

Page 16: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

16

O,…,P-1 || p Program Vector_Sum_2

declare

i: integer;

x,y,z: array[Jp] of real;

initially

<; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)>

assign

<|| i: iJp :: z[i] = x[i] + y[i]>

end

Notice that we have one program for each processor - all programs being identical.

In each program, the identifier p is known. Also the mapping is assumed to be known.

The result is stored in a distributed vector z.

Page 17: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

17

Performance analysis

Let P be the number of processors, and let

T = T(P)

denote the execution time for a program on this multicomputer.

Performance analysis is the study of the properties of T(P).

In order to analyze concurrent algorithms, we have to assume certain properties of the computer. In fact these assumptions are rather strict and thus leave out a lot of existing computers.

On the other hand; without these assumptions the analysis tend to be extremely complicated.

Page 18: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

18

Observation

• Let T(1) be the fastest possible scalar computation, then

T(P) T(1)/P.

This relation states a bound for how fast a computation can be done on a parallel computer compared with a scalar computer.

Page 19: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

19

Definitions

• Speed-up:The speed-up of a P-node computation with execution time

T(P) is given by

S(P) = T(1)/T(P).

• Efficiency:The efficiency of a P-node computation with speed-up S(P) is

given by

(P) = S(P)/P.

Page 20: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

20

Discussion

• Suppose we are in an optimal situation, i.e., we have

T(P) = T(1)/P.

Then the speed-up is given by

S(P) = T(1)/T(P) = P,

and the efficiency is

(P) = S(P)/P = 1.

Page 21: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

21

More generally we have T(P) T(1)/P,

which implies that

S(P) = T(1)/T(P) P,

and

(P) = S(P)/P 1.

In practical computations we are pleased if we are close to the optimal results. A speed-up close to P and to an efficiency close to 1 is very good. Practical details often result in weaker performance than expected from the analysis.

Page 22: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

22

Efficiency modelling

• Goal: estimate how fast a certain algorithm can run on a multicomputer. The models depend on the following parameters:

A = Arithmetic time; the time of one single arithmetic operation. Integer ops ignored, all nodes assumed equal.

C(L) = Message exchange time; the time it takes to send a message of length L (in proper units) from one processor to another. We assume that this time is equal for any pair of processors.

L = Latency; the start-up time for a communication - or the time it takes to send a message of length zero.

• 1/ = Bandwidth; the maximum rate of messages (in proper units) that can be exchanged.

Page 23: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

23

Efficiency modelling

In our efficiency models, we will assume that there is a linear relation between the message exchange time and the length of the message:

C(L) = L + L.

Page 24: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

24

Analysis of Vector_Sum_2

O,…,P-1 || p Program Vector_Sum_2

declare

i: integer;

x,y,z: array[Jp] of real;

initially

<; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)>

assign

<|| i: iJp :: z[i] = x[i] + y[i]>

end

Recall that Jp={0,1,…,Ip-1}, and define I = maxp Ip.

Then a model of the execution time is given by

T(P) = 3 maxp Ip A = 3I A.

Notice that there are three arithmetic operations for each entry of the array.

Page 25: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

25

Load balancing

• Obviously, we would like to balance the load of the processors. Basically, we would like to have each of them perform approximately the same number of operations. (Recall that we assume all processors of same capacity).

• In the notation used in the present vector operation, we have load balance if I is as small as possible.

• In the case that M (the number of array entries) is a multiple of P (the number of processors), we have load balance if

I = M/P,

meaning that there are equally many vector entries on each processor.

Page 26: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

26

Speed-up

For this problem the speed-up is

S(P) = T(1)/T(P) = 3MA / 3IA = M/I.

If the problem is load balanced, we have

I = M/P

and thus

S(P) = P

which is optimal.

Notice that we are typically interested in very large values of M, say M=106-109. The number of processors P are usually below 1000.

Page 27: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

27

The communication cost

In the above example, no communication at all was necessary. In the next example, one real number must be communicated.

This changes the analysis a bit!

Page 28: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

28

The communication cost

• Purpose:

– derive a multicomputer program for computation of an inner product.

• Given

– Length of vectors M.

– Data xm, ym, m=0,1,…,M-1 of real numbers.

– Number of processors P.

– Set of indices Jp={0,1,…,Ip-1} where the number of entries Ip on the p-th processor is given.

– A one-to-one mapping between global and local indices.

• Compute

= (x,y), i.e., = x[0] y[0] + x[1] y[1] + … + x[M-1] y[M-1] .

Page 29: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

29

Program Inner_Product

O,…,P-1 || p Program Inner_Product

declare

i: integer;

w: array[0,1,…,P-1] of real;

x,y: array[Jp] of real;

initially

<; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)>

assign

w[p] = < +i : iJp :: x[i] y[i]>;

send w[p] to all

<; q: 0 q < P and q p:: receive w[q] from q >;

= < +q: 0 q < P:: w[q] >;

end

Page 30: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

30

Performance modelling of Inner_Product

• Recall Jp = {0,1,…,Ip-1} and I = maxp Ip.

• A model of the execution time for Inner_Product is given by

T(P) = (2I-1) A + (P-1) C(1) + (P-1) A

Here the first term arises from the sum of x[i]y[i] over local i values (Ip multiplications and Ip-1 additions).

The second term arises from the cost of sending one real number from one processors to all others.

The third term arises from the computation of the inner product based on the values on each processor.

Page 31: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

31

Simplifications

Assume I = M/P, i.e., a load balanced problem.

Assume (as always) P M, and C(1) = A

(for practical computers is quite large, 50-1000).

We then have

T(P) 2IA + PC(1),

or

T(P) (2M/P + P)A.

Page 32: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

32

Example I

• Choosing M=105 and = 50, we get

T(P) = (2* 105/P + 50P)

200 400 600 800 1000

20000

40000

60000

80000

Page 33: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

33

Example II

• Choosing M=107 and = 50, we get

T(P) = (2* 107/P + 50P)

200 400 600 800 1000

500000

1´106

1.5 ´106

2´106

Page 34: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

34

Speed-up

For this problem, the speed-up is

S(P) = T(1)/T(P) [(2M+ ) A ] / [(2M/P + P ) A ]

= P [1+/(2M)] / [1+ P2/(2M)].

Optimal speed-up characterized by S(p) P, we must require

P2/(2M) 1

in order for this to be the case.