parallel programming in c with the message passing interfaceirodero/classes/09-10/... · title:...

26
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18 Combining MPI and OpenMP Combining MPI and OpenMP

Upload: others

Post on 17-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter 18

Combining MPI and OpenMPCombining MPI and OpenMP

Page 2: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Outline

Advantages of using both MPI and Advantages of using both MPI and OpenMPOpenMPCase Study: Conjugate gradient methodCase Study: Conjugate gradient methodCase Study: Jacobi methodCase Study: Jacobi method

Page 3: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

C+MPI vs. C+MPI+OpenMPIn

terc

onne

ctio

n N

etw

ork P P P P

P P P P

P P P P

P P P P Inte

rcon

nect

ion

Net

wor

k Pt t t t

Pt t t t

Pt t t t

Pt t t t

C + MPI C + MPI + OpenMP

Page 4: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Why C + MPI + OpenMPCan Execute Faster

Lower communication overheadLower communication overheadMore portions of program may be practical More portions of program may be practical to parallelizeto parallelizeMay allow more overlap of communications May allow more overlap of communications with computations with computations

Page 5: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Case Study: Conjugate Gradient

Conjugate gradient method solves Conjugate gradient method solves AxAx = = bbIn our program we assume In our program we assume AA is denseis denseMethodologyMethodology

Start with MPI programStart with MPI programProfile functions to determine where Profile functions to determine where most execution time spentmost execution time spentTackle most timeTackle most time--intensive function firstintensive function first

Page 6: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Result of Profiling MPI Program

FunctionFunction 1 CPU1 CPU 8 CPUs8 CPUsmatrix_vector_productmatrix_vector_product 99.55%99.55% 97.49%97.49%dot_productdot_product 0.19%0.19% 1.06%1.06%cgcg 0.25%0.25% 1.44%1.44%

Clearly our focus needs to be on functionmatrix_vector_product

Page 7: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Code for matrix_vector_productvoid matrix_vector_product (int id, int p,

int n, double **a, double *b, double *c){

int i, j;double tmp; /* Accumulates sum */for (i=0; i<BLOCK_SIZE(id,p,n); i++) {

tmp = 0.0;for (j = 0; j < n; j++)

tmp += a[i][j] * b[j];piece[i] = tmp;

}new_replicate_block_vector (id, p,

piece, n, (void *) c, MPI_DOUBLE);}

Page 8: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Adding OpenMP directives

Want to minimize fork/join overhead by Want to minimize fork/join overhead by making parallel the outermost possible loopmaking parallel the outermost possible loopOuter loop may be executed in parallel if Outer loop may be executed in parallel if each thread has a private copy of each thread has a private copy of tmp tmp and jand j

#pragma #pragma omp omp parallel for private(j,parallel for private(j,tmptmp))for (i=0; i<BLOCK_SIZE(id,p,n); i++) {for (i=0; i<BLOCK_SIZE(id,p,n); i++) {

Page 9: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

User Control of Threads

Want to give user opportunity to specify Want to give user opportunity to specify number of active threads per processnumber of active threads per processAdd a call to Add a call to ompomp_set_num_threads to _set_num_threads to function mainfunction mainArgument comes from command lineArgument comes from command line

ompomp_set_num_threads (_set_num_threads (atoiatoi((argvargv[3]));[3]));

Page 10: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

What Happened?

We transformed a C+MPI program to a We transformed a C+MPI program to a C+MPI+OpenMP program by adding only C+MPI+OpenMP program by adding only two lines to our program!two lines to our program!

Page 11: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Benchmarking

Target system: a commodity cluster with four Target system: a commodity cluster with four dualdual--processor nodesprocessor nodesC+MPI program executes on 1, 2, ..., 8 CPUsC+MPI program executes on 1, 2, ..., 8 CPUsOn 1, 2, 3, 4 CPUs, each process on different On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPUnode, maximizing memory bandwidth per CPUC+MPI+OpenMP program executes on 1, 2, 3, 4 C+MPI+OpenMP program executes on 1, 2, 3, 4 processesprocessesEach process has two threadsEach process has two threadsC+MPI+OpenMP program executes on 2, 4, 6, 8 C+MPI+OpenMP program executes on 2, 4, 6, 8 threadsthreads

Page 12: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Results of Benchmarking

Page 13: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Analysis of Results

C+MPI+OpenMP program slower on 2, 4 C+MPI+OpenMP program slower on 2, 4 CPUs because C+MPI+OpenMP threads are CPUs because C+MPI+OpenMP threads are sharing memory bandwidth, while C+MPI sharing memory bandwidth, while C+MPI processes are notprocesses are notC+MPI+OpenMP programs faster on 6, 8 C+MPI+OpenMP programs faster on 6, 8 CPUs because they have lower CPUs because they have lower communication costcommunication cost

Page 14: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Case Study: Jacobi Method

Begin with C+MPI program that uses Begin with C+MPI program that uses Jacobi method to solve steady state heat Jacobi method to solve steady state heat distribution problem of Chapter 13distribution problem of Chapter 13Program based on rowwise block striped Program based on rowwise block striped decomposition of twodecomposition of two--dimensional matrix dimensional matrix containing finite difference meshcontaining finite difference mesh

Page 15: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Methodology

Profile execution of C+MPI programProfile execution of C+MPI programFocus on adding OpenMP directives to Focus on adding OpenMP directives to most computemost compute--intensive functionintensive function

Page 16: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Result of Profiling

FunctionFunction 1 CPU1 CPU 8 CPUs8 CPUsinitialize_meshinitialize_mesh 0.01%0.01% 0.03%0.03%find_steady_statefind_steady_state 98.48%98.48% 93.49%93.49%print_solutionprint_solution 1.51%1.51% 6.48%6.48%

Page 17: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Function find_steady_state (1/2)its = 0;for (;;) {

if (id > 0)MPI_Send (u[1], N, MPI_DOUBLE, id-1, 0,

MPI_COMM_WORLD);if (id < p-1) {

MPI_Send (u[my_rows-2], N, MPI_DOUBLE, id+1,0, MPI_COMM_WORLD);

MPI_Recv (u[my_rows-1], N, MPI_DOUBLE, id+1,0, MPI_COMM_WORLD, &status);

}if (id > 0)

MPI_Recv (u[0], N, MPI_DOUBLE, id-1, 0,MPI_COMM_WORLD, &status);

Page 18: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Function find_steady_state (2/2)diff = 0.0;for (i = 1; i < my_rows-1; i++)

for (j = 1; j < N-1; j++) {w[i][j] = (u[i-1][j] + u[i+1][j] +

u[i][j-1] + u[i][j+1])/4.0;if (fabs(w[i][j] - u[i][j]) > diff)

diff = fabs(w[i][j] - u[i][j]);}

for (i = 1; i < my_rows-1; i++)for (j = 1; j < N-1; j++)

u[i][j] = w[i][j];MPI_Allreduce (&diff, &global_diff, 1,

MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);if (global_diff <= EPSILON) break;its++;

Page 19: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Making Function Parallel (1/2)

Except for two initializations and a return Except for two initializations and a return statement, function is a big statement, function is a big forfor looploopCannot execute Cannot execute forfor loop in parallelloop in parallel

Not in canonical formNot in canonical formContains a Contains a break break statementstatementContains calls to MPI functionsContains calls to MPI functionsData dependences between iterationsData dependences between iterations

Page 20: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Making Function Parallel (2/2)

Focus on first Focus on first forfor loop indexed by loop indexed by iiHow to handle multiple threads How to handle multiple threads testing/updating testing/updating diffdiff??Putting Putting if if statement in a critical section statement in a critical section would increase overhead and lower speedupwould increase overhead and lower speedupInstead, create private variable Instead, create private variable tdifftdiffThread testsThread tests tdifftdiff against against diffdiff before before call to call to MPI_MPI_AllreduceAllreduce

Page 21: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Modified Functiondiff = 0.0;

#pragma omp parallel private (i, j, tdiff){

tdiff = 0.0;#pragma omp for

for (i = 1; i < my_rows-1; i++)...

#pragma omp for nowaitfor (i = 1; i < my_rows-1; i++)

#pragma omp criticalif (tdiff > diff) diff = tdiff;

}MPI_Allreduce (&diff, &global_diff, 1,

MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);

Page 22: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Making Function Parallel (3/3)

Focus on second Focus on second forfor loop indexed by loop indexed by iiCopies elements of Copies elements of ww to corresponding to corresponding elements of elements of uu: no problem with executing in : no problem with executing in parallelparallel

Page 23: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Benchmarking

Target system: a commodity cluster with four Target system: a commodity cluster with four dualdual--processor nodesprocessor nodesC+MPI program executes on 1, 2, ..., 8 CPUsC+MPI program executes on 1, 2, ..., 8 CPUsOn 1, 2, 3, 4 CPUs, each process on different On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPUnode, maximizing memory bandwidth per CPUC+MPI+OpenMP program executes on 1, 2, 3, 4 C+MPI+OpenMP program executes on 1, 2, 3, 4 processesprocessesEach process has two threadsEach process has two threadsC+MPI+OpenMP program executes on 2, 4, 6, 8 C+MPI+OpenMP program executes on 2, 4, 6, 8 threadsthreads

Page 24: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Benchmarking Results

Page 25: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Analysis of Results

Hybrid C+MPI+OpenMP program uniformly Hybrid C+MPI+OpenMP program uniformly faster than C+MPI programfaster than C+MPI programComputation/communication ratio of hybrid Computation/communication ratio of hybrid program is superiorprogram is superiorNumber of mesh points per element Number of mesh points per element communicated is twice as high per node for the communicated is twice as high per node for the hybrid programhybrid programLower communication overhead leads to 19% Lower communication overhead leads to 19% better speedup on 8 CPUsbetter speedup on 8 CPUs

Page 26: Parallel Programming in C with the Message Passing Interfaceirodero/classes/09-10/... · Title: Parallel Programming in C with the Message Passing Interface Author: Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Summary

Many contemporary parallel computers consists of Many contemporary parallel computers consists of a collection of multiprocessorsa collection of multiprocessorsOn these systems, performance of On these systems, performance of C+MPI+OpenMP programs can exceed C+MPI+OpenMP programs can exceed performance of C+MPI programsperformance of C+MPI programsOpenMP enables us to take advantage of shared OpenMP enables us to take advantage of shared memory to reduce communication overheadmemory to reduce communication overheadOften, conversion requires addition of relatively Often, conversion requires addition of relatively few pragmasfew pragmas