Transcript
Page 1: Parallel Programming and MPI

© 2006 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

Parallel Programming and MPIA course for IIT-M. September 2008

R Badrinath, STSD Bangalore ([email protected])

Page 2: Parallel Programming and MPI

2 September 2008 IIT-Madras

Context and Background• IIT- Madras has recently added a good deal of

compute power.• Why –

−Further R&D in sciences, engineering

−Provide computing services to the region

−Create new opportunities in education and skills

−…

• Why this course – −Update skills to program modern cluster computers

• Length -2 theory and 2 practice sessions, 4 hrs each

Page 3: Parallel Programming and MPI

33

Audience CheckAudience Check

Page 4: Parallel Programming and MPI

4 September 2008 IIT-Madras

Contents

1. MPI_Init

2. MPI_Comm_rank3. MPI_Comm_size

4. MPI_Send5. MPI_Recv6. MPI_Bcast

7. MPI_Create_comm8. MPI_Sendrecv9. MPI_Scatter10. MPI_Gather

… … … … … …

Instead we

•Understand Issues

•Understand Concepts

•Learn enough to pickup from the manual

•Go by motivating examples

•Try out some of the examples

Page 5: Parallel Programming and MPI

5 September 2008 IIT-Madras

Outline• Sequential vs Parallel programming • Shared vs Distributed Memory• Parallel work breakdown models• Communication vs Computation• MPI Examples• MPI Concepts• The role of IO

Page 6: Parallel Programming and MPI

6 September 2008 IIT-Madras

Sequential vs Parallel• We are used to sequential programming – C,

Java, C++, etc. E.g., Bubble Sort, Binary Search, Strassen Multiplication, FFT, BLAST, …

• Main idea – Specify the steps in perfect order• Reality – We are used to parallelism a lot more

than we think – as a concept; not for programming

• Methodology – Launch a set of tasks; communicate to make progress. E.g., Sorting 500 answer papers by – making 5 equal piles, have them sorted by 5 people, merge them together.

Page 7: Parallel Programming and MPI

7 September 2008 IIT-Madras

• Shared Memory – All tasks access the same memory, hence the same data. pthreads

• Distributed Memory – All memory is local. Data sharing is by explicitly transporting data from one task to another (send-receive pairs in MPI, e.g.)

• HW – Programming model relationship – Tasks vs CPUs;

• SMPs vs Clusters

Shared vs Distributed Memory Programming

Program Memory Communications channel

Page 8: Parallel Programming and MPI

8

Designing Parallel Programs

Page 9: Parallel Programming and MPI

9 September 2008 IIT-Madras

Simple Parallel Program – sorting numbers in a large array A• Notionally divide A into 5 pieces

[0..99;100..199;200..299;300..399;400..499].

• Each part is sorted by an independent sequential algorithm and left within its region.

• The resultant parts are merged by simply reordering among adjacent parts.

Page 10: Parallel Programming and MPI

10 September 2008 IIT-Madras

What is different – Think about…• How many people doing the work. (Degree

of Parallelism)• What is needed to begin the work.

(Initialization)• Who does what. (Work distribution)• Access to work part. (Data/IO access)• Whether they need info from each other to

finish their own job. (Communication)• When are they all done. (Synchronization)• What needs to be done to collate the result.

Page 11: Parallel Programming and MPI

11 September 2008 IIT-Madras

Work Break-down• Parallel algorithm• Prefer simple intuitive breakdowns• Usually highly optimized sequential

algorithms are not easily parallelizable• Breaking work often involves some pre- or

post- processing (much like divide and conquer)

• Fine vs large grain parallelism and relationship to communication

Page 12: Parallel Programming and MPI

12 September 2008 IIT-Madras

Digression – Let’s get a simple MPI Program to work#include <mpi.h>#include <stdio.h>

int main(){int total_size, my_rank;

MPI_Init(NULL,NULL);

MPI_Comm_size(MPI_COMM_WORLD, &total_size);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

printf("\n Total number of programs = %d, out of which rank of this process is %d\n", total_size, my_rank);MPI_Finalize();return 0;

}

Page 13: Parallel Programming and MPI

13 September 2008 IIT-Madras

Getting it to work• Compile it:

−mpicc –o simple simple.c # If you want HP-MPI set your path

# /opt/hpmpi/bin• Run it

− This depends a bit on the system−mpirun -np2 simple−qsub –l ncpus=2 –o simple.out /opt/hpmpi/bin/mpirun

<your program location>/simple−[Fun: qsub –l ncpus=2 –I hostname ]

• Results are in the output file.• What is mpirun ? • What does qsub have to do with MPI?... More about qsub in a

separate talk.

Page 14: Parallel Programming and MPI

14 September 2008 IIT-Madras

What goes on• Same program is run at the same time on 2

different CPUs• Each is slightly different in that each returns

different values for some simple calls like MPI_Comm_rank.

• This gives each instance its identity• We can make different instances run

different pieces of code based on this identity difference

• Typically it is an SPMD model of computation

Page 15: Parallel Programming and MPI

15 September 2008 IIT-Madras

Continuing work breakdown…

Simple Example: Find shortest distances

5

7

21

6

3

2

2

7

Let Nodes be numbered 0,1,…,n-1

Let us put all of this in a matrix

A[i][j] is the distance from i to j

1

2 3

0 4

0 2 1 .. 67 0 .. .. ..

1 5 0 2 3.. .. 2 0 2

.. .. .. .. 0

PROBLEM:

Find shortest path distances

Page 16: Parallel Programming and MPI

16 September 2008 IIT-Madras

Floyd’s (sequential) algorithm

For (k=0; k<n; k++)

For (i=0; i<n; i++)

for (j=0; j<n; j++)

a[i][j]=min( a[i][j] , a[i,k]+a[k][j] );

Observation:

For a fixed k,

Computing i-th row needs i-th row and k-th row

Page 17: Parallel Programming and MPI

17 September 2008 IIT-Madras

Parallelizing Floyd• Actually we just need n2 tasks, with each

task iterating n times (once for each value of k).

• After each iteration we need to make sure everyone sees the matrix.

• ‘Ideal’ for shared memory.. Programming• What if we have less than n2 tasks?... Say

p<n.• Need to divide the work among the p tasks.• We can simply divide up the rows.

Page 18: Parallel Programming and MPI

18 September 2008 IIT-Madras

Dividing the work• Each task gets [n/p] rows, with the last

possibly getting a little more.

T0

Tq

q x [ n/p ]

Remember the observation

i-th row

k-th row

Page 19: Parallel Programming and MPI

19 September 2008 IIT-Madras

/* “id” is TASK NUMBER, each node has only the part of A that it owns. This is approximate code */

for (k=0;k<n;k++) {

current_owner_task = GET_BLOCK_OWNER(k);

if (id == current_owner_task) {

k_here = k - LOW_END_OF_MY_BLOCK(id);

for(j=0;j<n;j++)

rowk[j]=a[k_here][j];

}

/* rowk is broadcast by the owner and received by others..

The MPI code will come here later */

for(i=0;i<GET_MY_BLOCK_SIZE(id);i++)

for(j=0;j<n;j++)

a[i,j]=min(a[i][j],

a[i][k]+rowk[j]);

}

The MPI Model…

-All nodes run the same code!! P replica tasks!!…

-Some times they need to do different things

Note that each node calls its own matrix by the same name name a [ ][ ] but has only [p/n] rows.

Distributed Memory Model

Page 20: Parallel Programming and MPI

20 September 2008 IIT-Madras

The MPI model• Recall MPI tasks are typically created when

the jobs are launched – not inside the MPI program (no forking).−mpirun usually creates the task set

−mpirun –np 2 a.out <args to a.out>

−a.out is run on all nodes and a communication channel is setup between them

• Functions allow for tasks to find out −Size of the task group

−Ones own position within the group

Page 21: Parallel Programming and MPI

21 September 2008 IIT-Madras

MPI Notions [ Taking from the example ]• Communicator – A group of tasks in a program• Rank – Each task’s ID in the group

−MPI_Comm_rank() … /* use this to set “id” */

• Size – Of the group−MPI_Comm_size() … /* use to set “p” */

• Notion of send/receive/broadcast…−MPI_Bcast() … /* use to broadcast rowk[] */

• For actual syntax use a good MPI book or manual

• Online resource: http://www-unix.mcs.anl.gov/mpi/www/

Page 22: Parallel Programming and MPI

22 September 2008 IIT-Madras

MPI Prologue to our Floyd exampleint a[MAX][MAX];int n=20; /* real size of the matrix,

can be read in */int id,p;

MPI_Init(argc,argv);

MPI_Comm_rank(MPI_COMM_WORLD,&id);MPI_Comm_size(MPI_COMM_WORLD,&p);

.

./* This is where all the real work happens */

. MPI_Finalize(); /* Epilogue */

Page 23: Parallel Programming and MPI

2323

This is the time to try out several This is the time to try out several simple MPI programs using the simple MPI programs using the

few functions we have seen.few functions we have seen.- use mpicc- use mpicc

- use mpirun- use mpirun

Page 24: Parallel Programming and MPI

24 September 2008 IIT-Madras

Visualizing the execution

Job is LaunchedTasks On CPUs

Multiple Tasks/CPUs maybe on the same node

Scheduler ensures 1 task per cpu

•MPI_INIT, MPI_Comm_rank, MPI_Comm_size etc…

•Other initializations, like reading in the array

•For initial values of k, task with rank 0 broadcasts row k, others receive

•For each value of k they do their computation with the correct rowk

•Loop above for all values of k

•Task 0 receives all blocks of the final array and prints them out

•MPI_Finalize

Page 25: Parallel Programming and MPI

25 September 2008 IIT-Madras

Communication vs Computation• Often communication is needed between iterations

to complete the work.• Often the more the tasks the more the

communication can become. −In Floyd, bigger “p” indicates that “rowk” will be sent to a

larger number of tasks.−If each iteration depends on more data, it can get very

busy.

• This may mean network contention; i.e., delays.• Try to count the numbr of “a”s in a string. Time vs p• This is why for a fixed problem size increasing

number of CPUs does not continually increase performance

• This needs experimentation – problem specific

Page 26: Parallel Programming and MPI

26 September 2008 IIT-Madras

Communication primitives•MPI_Send(sendbuffer, senddatalength,

datatype, destination, tag, communicator);

•MPI_Send(“Hello”, strlen(“Hello”), MPI_CHAR, 2 , 100, MPI_COMM_WORLD);

•MPI_Recv(recvbuffer, revcdatalength, MPI_CHAR, source, tag,

MPI_COMM_WORLD,&status);

•Send-Recv happen in pairs.

Page 27: Parallel Programming and MPI

27 September 2008 IIT-Madras

Collectives• Broadcast is one-to-all communication• Both receivers and sender call the same

function• All MUST call it. All end up with SAME result.• MPI_Bcast (buffer, count, type, root, comm);• Examples

−MPI_Bcast(&k, 1, MPI_INT, 0, MPI_Comm_World);

−Task 0 sends its integer k and all others receive it.−MPI_Bcast(rowk,n,MPI_INT,current_owner_task,MPI_COMM_WORLD);

−Current_owner_task sends rowk to all others.

Page 28: Parallel Programming and MPI

2828

Try out a simple MPI program withTry out a simple MPI program withsend-recvs and braodcasts.send-recvs and braodcasts.

Try out Floyd’s algorithm.Try out Floyd’s algorithm.What if you have to read a file to What if you have to read a file to

initialize Floyd’s algorithm?initialize Floyd’s algorithm?

Page 29: Parallel Programming and MPI

29 September 2008 IIT-Madras

A bit more on Broadcast

MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..);

Ranks: 0 1 2 x : 0 1 2

x : 0 0 0

0

0 00

Page 30: Parallel Programming and MPI

30 September 2008 IIT-Madras

Other useful collectives• MPI_Reduce(&values,&results,count,type,o

perator, root,comm);• MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,

9, MPI_COMM_WORLD);

• Task number 9 gets in the variable res the sum of whatever was in x in all of the tasks (including itself).

• Must be called by ALL tasks.

Page 31: Parallel Programming and MPI

31 September 2008 IIT-Madras

Scattering as opposed to broadcasting• MPI_Scatterv(sndbuf, sndcount[], send_disp[],

type, recvbuf, recvcount, recvtype, root, comm);

• All nodes MUST call

Rank0

Rank0 Rank1 Rank2 Rank3

Page 32: Parallel Programming and MPI

32 September 2008 IIT-Madras

Common Communication pitfalls!!

• Make sure that communication primitives are called by the right number of tasks.

• Make sure they are called in the right sequence.

• Make sure that you use the proper tags.

• If not, you can easily get into deadlock (“My program seems to be hung”)

Page 33: Parallel Programming and MPI

33 September 2008 IIT-Madras

More on work breakdown• Finding the right work breakdown can be

challenging• Sometime dynamic work breakdown is good• Master (usually task 0) decides who will do

what and collects the results.• E.g., you have a huge number of 5x5

matrices to multiply (chained matrix multiplication).

• E.g., Search for a substring in a huge collection of strings.

Page 34: Parallel Programming and MPI

34 September 2008 IIT-Madras

Master-slave dynamic work assignment

01

2

3

4

Master

Slaves

Page 35: Parallel Programming and MPI

35 September 2008 IIT-Madras

Master slave example – Reverse stringsSlave(){do{ MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat); n=strlen(work); if(n==0) break; /* detecting the end */

reverse(work);

MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD); } while (1);

MPI_Finalize();}

Page 36: Parallel Programming and MPI

36 September 2008 IIT-Madras

Master slave example – Reverse stringsMaster(){ /* rank 0 task */initialize_work_tems();for(i=1;i<np;i++){ /* Initial work distribution */ work=next_work_item(); n = strlen(work)+1; MPI_Send(&work,n,MPI_CHAR,i,0,MPI_COMM_WORLD);

}unfinished_work=np;

while (unfinished_work!=0) { MPI_Recv(&res,MAX,MPI_CHAR,MPI_ANY_SOURCE,0, MPI_COMM_WORLD,&status); process(res); work=next_work_item();

if(work==NULL) unfinished_work--; else {

n=strlen(work)+1; MPI_Send(&work,n,MPI_CHAR,status->MPI_source, 0,MPI_COMM_WORLD); }}

Page 37: Parallel Programming and MPI

37 September 2008 IIT-Madras

Master slave exampleMain(){

...

MPI_Comm_Rank(MPI_COMM_WORLD,&id);

MPI_Comm_size(MPI_COMM_WORLD,&np);

if (id ==0 )

Master();

else

Slave();

...

}

Page 38: Parallel Programming and MPI

38

Matrix Multiply and Communication Patterns

Page 39: Parallel Programming and MPI

39 September 2008 IIT-Madras

•Each task owns a block – its own part of A,B and C

•The old formula holds for blocks!

•Example:

C21=A20 * B01

A21 * B11

A22 * B21

A23 * B31

Block Distribution of Matrices• Matrix Mutliply:

−Cij = Σ (Aik * Bkj)

• BMR Algorithm:

Each is a smaller Block – a submatrix

Page 40: Parallel Programming and MPI

40 September 2008 IIT-Madras

Block Distribution of Matrices• Matrix Mutliply:

−Cij = Σ (Aik * Bkj)

• BMR Algorithm:

Each is a smaller Block – a submatrix

C21 = A20 * B01

A21 * B11

A22 * B21

A23 * B31

•A22 is row broadcast

•A22*B21 added into C21

•B_1 is Rolled up one slot

•Out task now has B31Now repeat the above block except

the item to broadcast is A23

Page 41: Parallel Programming and MPI

4141

Attempt doing this with just Send-Attempt doing this with just Send-Recv and BroadcastRecv and Broadcast

Page 42: Parallel Programming and MPI

42 September 2008 IIT-Madras

Communicators and Topologies• BMR example shows limitations of

broadcast.. Although there is pattern• Communicators can be created on

subgroups of processes.• Communicators can be created that have a

topology −Will make programming natural

−Might improve performance by matching to hardware

Page 43: Parallel Programming and MPI

43 September 2008 IIT-Madras

for (k = 0; k < s; k++) {

sender = (my_row + k) % s;

if (sender == my_col) { MPI_Bcast(&my_A, m*m, MPI_INT, sender, row_comm);

T = my_A;

else MPI_Bcast(&T, m*m, MPI_INT, sender, row_comm);

my_C = my_C + T x my_B;

}

MPI_Sendrecv_replace(my_B, m*m, MPI_INT, dest, 0, source, 0, col_comm, &status); }

Page 44: Parallel Programming and MPI

44 September 2008 IIT-Madras

Creating topologies and communicators• Creating a grid• MPI_Cart_create(MPI_COMM_WORLD, 2,

dim_sizes, istorus, canreorder, &grid_comm); −int dim_sizes[2], int istorus[2], int canreorder,

MPI_Comm grid_comm

• Divide a grid into rows- each with own communicator

• MPI_Cart_sub(grid_comm,free,&rowcom)−MPI_Comm rowcomm; int free[2]

Page 45: Parallel Programming and MPI

4545

Try implementing the BMR Try implementing the BMR algorithm with communicatorsalgorithm with communicators

Page 46: Parallel Programming and MPI

46 September 2008 IIT-Madras

A brief on other MPI Topics – The last leg• MPI+Multi-threaded / OpenMP• One sided Communication• MPI and IO

Page 47: Parallel Programming and MPI

47 September 2008 IIT-Madras

MPI and OpenMP

……

•Grain

•Communication

•Where does the interesting pragma omp for fit in our MPI Floyd?

•How do I assign exactly one MPI task per CPU?

Page 48: Parallel Programming and MPI

48 September 2008 IIT-Madras

One-Sided Communication• Have no corresponding send-recv pairs!• RDMA• Get• Put

Page 49: Parallel Programming and MPI

49 September 2008 IIT-Madras

IO in Parallel Programs• Typically a root task, does the IO.

−Simpler to program

−Natural because of some post processing occasionally needed (sorting)

−All nodes generating IO requests might overwhelm fileserver, essentially sequentializing it.

• Performance not the limitation for Lustre/SFS.• Parallel IO interfaces such as MPI-IO can make

use of parallel filesystems such as Lustre.

Page 50: Parallel Programming and MPI

50 September 2008 IIT-Madras

MPI-BLAST exec time vs other time[4]

Page 51: Parallel Programming and MPI

51 September 2008 IIT-Madras

How IO/Comm Optimizations help MPI-BLAST[4]

Page 52: Parallel Programming and MPI

52 September 2008 IIT-Madras

What did we learn?• Distributed Memory Programming Model• Parallel Algorithm Basics• Work Breakdown• Topologies in Communication• Communication Overhead vs Computation• Impact of Parallel IO

Page 53: Parallel Programming and MPI

53 September 2008 IIT-Madras

What MPI Calls did we see here?1. MPI_Init2. MPI_Finalize3. MPI_Comm_size4. MPI_Comm_Rank5. MPI_Send6. MPI_Recv7. MPI_Sendrecv_replace8. MPI_Bcast9. MPI_Reduce10. MPI_Cart_create11. MPI_Cart_sub12. MPI_Scatter

Page 54: Parallel Programming and MPI

54 September 2008 IIT-Madras

References1. Parallel Programming in C with MPI and OpenMP,

M J Quinn, TMH. This is an excellent practical book. Motivated much of the material here, specifically Floyd’s algorithm.

2. BMR Algorithm for Matrix Multiply and topology ideas is motivated by http://www.cs.indiana.edu/classes/b673/notes/matrix_mult.html

3. MPI online manual http://www-unix.mcs.anl.gov/mpi/www/

4. Efficient Data Access For Parallel BLAST, IPDPDS’05


Top Related