introduction to mpi, openmp, threads

Introduction to MPI, OpenMP, Threads

Gyan Bhanot

[email protected], [email protected]

IAS Course 10/12/04 and 10/13/04

mailto:[email protected]

mailto:[email protected]

Download tar file from clustermgr.csb.ias.edu: ~gyan/course/all.tar.gz

Has many MPI codes + .doc files with information on

optimization and parallelization for the IAS cluster

P655 Cluster

Type: qcpu to get machine specs

IAS Cluster Characteristics (qcpu,pmcycles)

IBM P655 clusterEach node has it's own copy of AIX – which is IBM’s Unix OSClustermgr: 2 CPU PWR4, 64KB L1 Inst Cache , 32 KB L1 Data Cache, 128 B L1

Data Cache Line Size1536 KB L2 Cache, data TLB: Size = 1024, associativity = 4, instruction

TLB: Size = 1024, associativity = 4 , freq = 1200 MHz node1 to node6: 8 CPUs/node, PWR4 P655, 64 KB L1 Inst Cache, 32 KB L1 Data

Cache, 128 B Data Cache Line Size, 1536 KB L2 Cache Size, data TLB: Size = 1024, associativity = 4, instruction TLB: Size = 1024, associativity = 4, freq = 1500 MHz

Distributed-memory architecture, shared-memory within each node. Shared file-system : GPFS, Lots of disk space.

Run pingpong tests to determine Latency and Bandwidth

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

/*----------------------*//* Parallel hello world *//*----------------------*/#include <stdio.h>#include <math.h>#include <mpi.h>Int main(int argc, char * argv[]){ int taskid, ntasks; double pi; /*------------------------------------*/ /* establish the parallel environment */ /*------------------------------------*/ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); /*------------------------------------*/ /* say hello from each MPI task */ /*------------------------------------*/ printf("Hello from task %d.\n", taskid);

if (taskid == 0) pi = 4.0*atan(1.0); else pi = 0.0; /*------------------------------------*/ /* do a broadcast from node 0 to all */ /*------------------------------------*/ MPI_Bcast(&pi, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); printf("node %d: pi = %.10lf\n", taskid, pi);

MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return(0);}

Hello from task 0.node 0: pi = 3.1415926536Hello from task 1.Hello from task 2.Hello from task 3.node 1: pi = 3.1415926536node 2: pi = 3.1415926536node 3: pi = 3.1415926536

OUTPUT FROM hello.c on 4 processors

1. Why is the order messed up?2. What would you do to fix it?

Answer: 1. The control flow on different processors is not ordered – they all run their own copy of the executable independently. Thus, when each writes output it does so independently of the others – which makes the output unordered.2. To fix it :

export MP_STDOUTMODE=ordered Then the output will look like the following:

Hello from task 0.node 0: pi = 3.1415926536Hello from task 1.node 1: pi = 3.1415926536Hello from task 2.node 2: pi = 3.1415926536Hello from task 3.node 3: pi = 3.1415926536

Pingpong Code on 4 procs of P655 cluster

• /* This program times blocking send/receives, and reports the *//* latency and bandwidth of the communication system. It is *//* designed to run with an even number of mpi tasks.

http://ipong.c.doc/

msglen = 32000 bytes, elapsed time = 0.3494 msecmsglen = 40000 bytes, elapsed time = 0.4000 msecmsglen = 48000 bytes, elapsed time = 0.4346 msecmsglen = 56000 bytes, elapsed time = 0.4490 msecmsglen = 64000 bytes, elapsed time = 0.5072 msecmsglen = 72000 bytes, elapsed time = 0.5504 msecmsglen = 80000 bytes, elapsed time = 0.5503 msecmsglen = 100000 bytes, elapsed time = 0.6499 msecmsglen = 120000 bytes, elapsed time = 0.7484 msecmsglen = 140000 bytes, elapsed time = 0.8392 msecmsglen = 160000 bytes, elapsed time = 0.9485 msecmsglen = 240000 bytes, elapsed time = 1.2639 msecmsglen = 320000 bytes, elapsed time = 1.5975 msecmsglen = 400000 bytes, elapsed time = 1.9967 msecmsglen = 480000 bytes, elapsed time = 2.3739 msecmsglen = 560000 bytes, elapsed time = 2.7295 msecmsglen = 640000 bytes, elapsed time = 3.0754 msecmsglen = 720000 bytes, elapsed time = 3.4746 msecmsglen = 800000 bytes, elapsed time = 3.7441 msecmsglen = 1000000 bytes, elapsed time = 4.6994 msec

latency = 50.0 microsecondsbandwidth = 212.79 MBytes/sec(approximate values for MPI_Isend/MPI_Irecv/MPI_Waitall)3. How do you find the Bandwidth and Latency from this data?

time vs Bytes sent with MPI_ISEND on 4 nodes p655

0.01

0.1

1

10

1 10 100 1000 10000 100000 1000000 10000000

Bytes

time

in m

sec

Y = X/B + L : B = Bandwidth (Bytes/sec), L = Latency

5. Monte Carlo to Compute π

Main Idea

• Consider unit Square with Embedded Circle• Generate Random Points inside Square• Out of N trials, m points are inside circle• Then π ~ 4m/N• Error ~ 1/N• Simple to Parallelize

Moldeling Method:

0 1

1

0

THROW MANY DARTS

FRACTION INSIDE CIRCLE = π/4

MPI PROGRAM DEFINES WORKING NODES

EACH NODE COMPUTES ESTIMATE OF PI INDEPENDENTLY

NODE 0 COMPUTES AVERAGES AND WRITES OUTPUT

#include <stdio.h>#include <math.h>#include <mpi.h>#include "MersenneTwister.h"void mcpi(int, int, int);int monte_carlo(int, int);//=========================================// Main Routine//=========================================int main(int argc, char * argv[]){ int ntasks, taskid, nworkers; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); if (taskid == 0) { printf(" #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s\n");} /*--------------------------------------------------*/ /* do monte-carlo with a variable number of workers */ /*--------------------------------------------------*/ for (nworkers=ntasks; nworkers>=1; nworkers = nworkers/2) { mcpi(nworkers, taskid, ntasks);} MPI_Finalize(); return 0;}

//============================================================// Routine to split tasks into groups and distribute the work//============================================================void mcpi(int nworkers, int taskid, int ntasks){ MPI_Comm comm; int worker, my_hits, total_hits, my_trials; int total_trials = 6400000; double tbeg, tend, elapsed, rate; double pi_estimate, est_error, abs_error; /*---------------------------------------------*/ /* make a group consisting of just the workers */ /*---------------------------------------------*/ if (taskid < nworkers) worker = 1; else worker = 0; MPI_Comm_split(MPI_COMM_WORLD, worker, taskid, &comm);

if (worker) { /*------------------------------------------*/ /* divide the work among all of the workers */ my_trials = total_trials / nworkers; MPI_Barrier(comm); tbeg = MPI_Wtime(); /* each worker gets a unique seed, and works independently */ my_hits = monte_carlo(taskid, my_trials); /* add the hits from each worker to get total_hits */ MPI_Reduce(&my_hits, &total_hits, 1, MPI_INT, MPI_SUM, 0, comm); tend = MPI_Wtime(); elapsed = tend - tbeg; rate = 1.0e-6*double(total_trials)/elapsed; /* report the results including elapsed times and rates */ if (taskid == 0) { pi_estimate = 4.0*double(total_hits)/double(total_trials); est_error = pi_estimate/sqrt(double(total_hits)); abs_error = fabs(M_PI - pi_estimate); printf("%6d %9d %9.5lf %9.5lf %9.5lf %8.3lf %9.2lf\n", nworkers, total_trials, pi_estimate, est_error, abs_error, elapsed, rate); } } MPI_Barrier(MPI_COMM_WORLD); }

//=========================================// Monte Carlo worker routine: return hits//=========================================int monte_carlo(int taskid, int trials){ int hits = 0; int xseed, yseed; double xr, yr;

xseed = 1 * (taskid + 1); yseed = 1357 * (taskid + 1);

MTRand xrandom( xseed ); MTRand yrandom( yseed );

for (int t=0; t<trials; t++) { xr = xrandom(); yr = yrandom(); if ( (xr*xr + yr*yr) < 1.0 ) hits++; } return hits; }

Run code in ~gyan/course/src/mpi/pi

Poe pi –procs 4 –hfile hf

using one node many processors

#cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s Speedup

4 6400000 3.14130 0.00140 0.00029 0.134 47.77 3.98

2 6400000 3.14144 0.00140 0.00016 0.267 23.96 1.997

1 6400000 3.14187 0.00140 0.00027 0.533 12.00 1.0

Run code in ~gyan/course/src/mpi/pi

Poe pi –procs 4 –hfile hf

using many nodes one processor #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s Speedup

4 6400000 3.14130 0.00140 0.00029 0.270 23.75 1.98

2 6400000 3.14144 0.00140 0.00016 0.536 11.94 0.99

1 6400000 3.14187 0.00140 0.00027 0.534 12.00 1.00

Generic Parallelization Problem• You are given a problem with N = K x L x M variables distributed

among P = xyz processors in block form (each processor has KLM/xyz variables) Each variable does F flops/variable before communicating B bytes/variable from each face to the near neighbor processor on the processor grid. Let the processing speed be f flops/s with a compute efficiency c. The computations and communciations do not overlap. Let the latency and bandwidth be d and b respectively, with a communication efficiency of e,

• a. Write a formula for the time between communication events. • b. Write a formula for the time between computation events. • c. Let K=L=M=N^(1/3) and x=y=z=p^(1/3). Write a formula for the

efficiency • E = [time to compute]/[time to compute + time to communicate].

Explore this as a function of F, B, P and N. Use c~e~1. To make this relevant to BG/L use, d = 10 microsec, b = 1.4Gb/s , f = 2.8GFlops/s.

Solution to Generic Parallelization Problem

• A_compute = Amount of computation = NF/P• t_compute = A_compute/(fc) = (NF/(Pcf)• Amount of data communicated by each

processor = A_communicate = B[(KL/(xy) + LM/(yz) + MK/(zx)]

• t_communicate = 6d + A_communicate/(be)• = 6d + (B/(be))[KL/(xy) + LM/(yz) + MK/(zx)]

6d + 6(B/(be))[N/P]^(2/3) for symmetric case

E = 1/[1 + 6d(fc/F)x + 6(c/e)*(fB/Fb)x^(1/3)] x = P/N

1. Weak Scaling: x = constant as P∞ Latency dominates when second term in denominator is

biggest.x > 5e(-6) (F/c) and x > 5e(-6) (B/e)^(3/2) using BG/L parametersa. F = B = 1 (transaction processing)

latency bound for N/P < 200,000b. F = 100, B = 1 latency bound if N/P < 2000 c. F = 1000, B = 100 latency bound if N/P < 200

c = e = 0.5; b = 1.4Gb/s, f = 2.8 GF/s – BG/L params

E = 1/[1 + 6d(fc/F)[P/N] + 6(c/e)*(fB/Fb)[P/N]^(1/3)]

Few NodesMany Variables

Nodes = Variables

Weak Scaling F=B

0.001

0.01

0.1

1

1 10 100 1000 10000 100000 1000000

N/P

Eff

icie

ncy F=1

F=10

F=100




Nodes = Variables

Weak Scaling F=10B

0.001

0.01

0.1

1

1 10 100 1000 10000 100000 1000000

N/P

Eff

icie

ncy

F=1000

F=10F=100




Nodes = Variables

Weak Scaling F=100B

0.001

0.01

0.1

1

1 10 100 1000 10000 100000 1000000

N/P

Eff

icie

ncy

F=1000F=100

F=10000

Weak Scaling, F=1000, B=100, Compare Terms

0.001

0.01

0.1

1

10

100

1 10 100 1000 10000 100000 1000000

N/P

Den

omin

ator

Ter

m V

alue

sE = 1/[1 + 6d(fc/F)[P/N] + 6(c/e)*(fB/Fb)[P/N]^(1/3)] = 1/[E1 + EL + EB]

E1EB

EL

Latency DominatedCompute Dominated

Bandwidth Affected

Strong Scaling

• N fixed, P ∞

• For any N, eventually P will win!

• Dominated by Latency and Bandwidth in Various Regimes

• Large Values of N scale better

Chart next slide

Strong Scaling, B=1, F = 100

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000 100000

P

E

N=1000000

N=100000

Lesson: In the Best of Cases, Strong Scaling is Very Hard to Achieve

introduction to mpi, openmp, threads

Documents