i nvestigate and p arallel p rocessing using e1350 ibm e s erver c luster ayaz ul hassan khan...

INVESTIGATE AND PARALLEL PROCESSING USING E1350 IBM ESERVER CLUSTERAyaz ul Hassan Khan (g201002860)

OBJECTIVES

Explore the architecture of E1350 IBM eServer Cluster

Parallel Programming: OpenMP MPI MPI+OpenMP

Analyzing the effects of above programming models on speedup

Finding out overheads and optimize as much as possible

IBM E1350 CLUSTER

CLUSTER SYSTEM The cluster is unique in its dual-boot capability with

Microsoft Windows HPC Server 2008 and Red Hat Enterprise Linux 5 operating systems.

The cluster has 3 master nodes, one for Red Hat Linux, one for Windows HPC Server 2008 and one for cluster management.

The cluster has 128 compute nodes. Each compute node of the cluster is dual-processor having

two 2.0 GHz x3550 Xeon Quad-core E5405 processors. The total number of cores in the cluster is 1024. Each master node has 1 TB of hard disk space and each

compute node has 500 GB of hard disk. Each master node has 8 GB of RAM. Each compute node has 4 GB of RAM. The interconnect is 10 GBASE-SR

EXPERIMENTAL ENVIRONMENT

Nodes: hpc081, hpc082, hpc083, hpc084 Compilers:

icc: for sequential and OpenMP programs mpiicc: for MPI and MPI+OpenMP programs

Profiling Tools: ompP: for OpenMP profiling mpiP: for MPI profiling

APPLICATIONS USED/IMPLEMENTED

Jacobi Iterative Method Max Speedup = 7.1 (OpenMP, Threads = 8) Max Speedup = 3.7 (MPI, Nodes = 4) Max Speedup = 9.3 (MPI+OpenMP, Nodes = 2,

Threads = 8) Alternating Direction Integration (ADI)

Max Speedup = 5.0 (OpenMP, Threads = 8) Max Speedup = 0.8 (MPI, Nodes = 1) Max Speedup = 1.7 (MPI+OpenMP, Nodes = 1,

Threads = 8)

JACOBI ITERATIVEMETHOD Solving systems of linear equations

JACOBI ITERATIVEMETHOD Sequential Codefor(i = 0; i < N; i++){

x[i] = b[i];}

for(i=0; i<N; i++){sum = 0.0;for(j=0; j<N; j++){

if(i != j){sum += a[i][j] * x[j];new_x[i] = (b[i] - sum)/a[i][i];

}for(i=0; i < N; i++)

x[i] = new_x[i];

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.80.9

sequential

Space Size (N)

JACOBI ITERATIVEMETHOD OpenMP Code#pragma omp parallel private(k,i,j, sum){

for(k = 0; k < MAX_ITER; k++){#pragma omp for

for(i=0; i<N; i++){sum = 0.0;for(j=0; j<N; j++){

if(i != j){sum += a[i][j] * x[j];new_x[i] = (b[i] - sum)/a[i][i];

}#pragma omp for

for(i=0; i < N; i++)x[i] = new_x[i];

JACOBI ITERATIVEMETHOD OpenMP Performance

128 256 384 512 640 768012345678

OpenMP (barrier)

2-cores4-cores8-cores

Space Size (N)

Speedup

128 256 384 512 640 7680

OpenMP (nowait)

Space Size (N)

Speedup

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.8

OpenMP (barrier)

Space Size (N)

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.8

OpenMP (nowait)

Space Size (N)

JACOBI ITERATIVEMETHOD ompP results (barrier)

R00002 jacobi_openmp.c (46-55) LOOP TID execT execC bodyT exitBarT taskT 0 0.09 100 0.07 0.01 0.00 1 0.08 100 0.07 0.00 0.00 2 0.08 100 0.07 0.01 0.00 3 0.08 100 0.07 0.01 0.00 4 0.08 100 0.07 0.01 0.00 5 0.08 100 0.07 0.01 0.00 6 0.08 100 0.07 0.01 0.00 7 0.08 100 0.07 0.01 0.00 SUM 0.65 800 0.59 0.06 0.00

JACOBI ITERATIVEMETHOD ompP results (nowait)

JACOBI ITERATIVEMETHOD MPI CodeMPI_Scatter(a, N * N/P, MPI_DOUBLE, apart, N * N/P, MPI_DOUBLE, 0, MPI_COMM_WORLD);MPI_Bcast(x, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);

for(i=myrank*N/P, k=0; k<N/P; i++, k++)bpart[k] = x[i];

for(k = 0; k < MAX_ITER; k++){for(i=0; i<N/P; i++){

sum = 0.0;for(j=0; j<N; j++){

index = i+((N/P)*myrank);if(index != j){

sum += apart[i][j] * x[j];new_x[i] = (bpart[i] - sum)/apart[i][index];

}MPI_Allgather(new_x, N/P, MPI_DOUBLE, x, N/P, MPI_DOUBLE, MPI_COMM_WORLD);}

JACOBI ITERATIVEMETHOD MPI Performance

128 256 384 512 640 7680

1-node2-nodes4-nodes

Space Size (N)

Speedup

128 256 384 512 640 7680

102030405060708090

Space Size (N)

JACOBI ITERATIVEMETHOD mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVAllgather 1 60.1 6.24 19.16 0.00Allgather 2 58.8 6.11 18.77 0.00Allgather 3 57.3 5.96 18.29 0.00Scatter 4 34.6 3.59 11.03 0.00Scatter 3 31.8 3.30 10.14 0.00Scatter 1 30.1 3.13 9.61 0.00Scatter 2 27 2.81 8.62 0.00Bcast 2 7.05 0.73 2.25 0.00Allgather 4 4.33 0.45 1.38 0.00Bcast 3 2.25 0.23 0.72 0.00Bcast 1 0.083 0.01 0.03 0.00Bcast 4 0.029 0.00 0.01 0.00

JACOBI ITERATIVEMETHOD MPI+OpenMP CodeMPI_Scatter(a, N * N/P, MPI_DOUBLE, apart, N * N/P, MPI_DOUBLE, 0, MPI_COMM_WORLD);MPI_Bcast(x, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);for(i=myrank*N/P, k=0; k<N/P; i++, k++)

bpart[k] = x[i];omp_set_num_threads(T);#pragma omp parallel private(k, i, j, index){for(k = 0; k < MAX_ITER; k++){#pragma omp for

for(i=0; i<N/P; i++){sum = 0.0;for(j=0; j<N; j++){

index = i+((N/P)*myrank);if(index != j){

sum += apart[i][j] * x[j];new_x[i] = (bpart[i] - sum)/apart[i][index];

}#pragma omp master{

MPI_Allgather(new_x, N/P, MPI_DOUBLE, x, N/P, MPI_DOUBLE, MPI_COMM_WORLD);}}}

JACOBI ITERATIVEMETHOD MPI+OpenMP Performance

128 256 384 512 640 7680123456789

MPI+OpenMP

Space Size (N)

Speedup

128 256 384 512 640 7680

MPI+OpenMP

Space Size (N)

128 256 384 512 640 7680

102030405060708090

MPI+OpenMP

Space Size (N)

JACOBI ITERATIVEMETHOD ompP results

R00002 jacobi_mpi_openmp.c (55-65) LOOP TID execT execC bodyT exitBarT taskT 0 0.03 100 0.02 0.01 0.00 1 0.24 100 0.02 0.23 0.00 2 0.24 100 0.02 0.22 0.00 3 0.24 100 0.02 0.22 0.00 4 0.24 100 0.02 0.22 0.00 5 0.24 100 0.02 0.22 0.00 6 0.24 100 0.02 0.22 0.00 7 0.24 100 0.02 0.22 0.00 SUM 1.72 800 0.15 1.56 0.00

R00003 jacobi_mpi_openmp.c (67-70) MASTER TID execT execC 0 0.22 100 SUM 0.22 100

JACOBI ITERATIVEMETHOD mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVScatter 8 34.7 9.62 14.11 0.00Allgather 1 32.6 9.05 13.28 0.00Scatter 6 31.3 8.70 12.76 0.00Scatter 2 30.2 8.39 12.31 0.00Allgather 3 29.9 8.30 12.18 0.00Allgather 5 27.6 7.67 11.25 0.00Scatter 4 27.1 7.51 11.02 0.00Allgather 7 22.1 6.14 9.00 0.00Bcast 4 7.12 1.98 2.90 0.00Bcast 6 2.81 0.78 1.14 0.00Bcast 2 0.09 0.02 0.04 0.00Bcast 8 0.033 0.01 0.01 0.00

ADI Alternating Direction Integration

-= * /

ADI Sequential Code//////ADI forward & backword sweep along rows//////for (i = 0; i < N; i++){

for (j = 1; j < N; j++){x[i][j] = x[i][j]-x[i][j-1]*a[i][j]/b[i][j-1];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i][j-1];

}x[i][N-1] = x[i][N-1]/b[i][N-1];

}for (i = 0; i < N; i++)

for (j = N-2; j > 1; j--)x[i][j]=(x[i][j]-a[i][j+1]*x[i][j+1])/b[i][j];

////// ADI forward & backward sweep along columns//////for (j = 0; j < N; j++){

for (i = 1; i < N; i++){x[i][j] = x[i][j]-x[i-1][j]*a[i][j]/b[i-1][j];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i-1][j];

}x[N-1][j] = x[N-1][j]/b[N-1][j];

}for (j = 0; j < N; j++)

for (i = N-2; i > 1; i--)x[i][j]=(x[i][j]-a[i+1][j]*x[i+1][j])/b[i][j];

128 256 384 512 640 7680

sequential

Space Size (N)

ADI OpenMP Code#pragma omp parallel private(iter){for(iter = 1; iter <= MAXITER; iter++){//////ADI forward & backword sweep along rows//////#pragma omp for private(i,j) nowaitfor (i = 0; i < N; i++){

for (j = 1; j < N; j++){x[i][j] = x[i][j]-x[i][j-1]*a[i][j]/b[i][j-1];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i][j-1];

}x[i][N-1] = x[i][N-1]/b[i][N-1];

}#pragma omp for private(i,j)for (i = 0; i < N; i++)

for (j = N-2; j > 1; j--)x[i][j]=(x[i][j]-a[i][j+1]*x[i][j+1])/b[i][j];

////// ADI forward & backward sweep along columns//////#pragma omp for private(i,j) nowaitfor (j = 0; j < N; j++){

for (i = 1; i < N; i++){x[i][j] = x[i][j]-x[i-1][j]*a[i][j]/b[i-1][j];b[i][j]= b[i][j] - a[i][j]*a[i][j]/b[i-1][j];

}x[N-1][j] = x[N-1][j]/b[N-1][j];

}#pragma omp for private(i,j)for (j = 0; j < N; j++)

for (i = N-2; i > 1; i--)x[i][j]=(x[i][j]-a[i+1][j]*x[i+1][j])/b[i][j];

ADI OpenMP Performance

128 256 384 512 640 7680

OpenMP

Space Size (N)

Speedup

128 256 384 512 640 768012345678

OpenMP

Space Size (N)

ADI ompP results

R00002 adi_openmp.c (43-50) LOOP TID execT execC bodyT exitBarT taskT 0 0.18 100 0.18 0.00 0.00 1 0.18 100 0.18 0.00 0.00 2 0.18 100 0.18 0.00 0.00 3 0.18 100 0.18 0.00 0.00 4 0.18 100 0.18 0.00 0.00 5 0.18 100 0.18 0.00 0.00 6 0.18 100 0.18 0.00 0.00 7 0.18 100 0.18 0.00 0.00 SUM 1.47 800 1.47 0.00 0.00

ADI MPI CodeMPI_Bcast(a, N * N, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

for(i=myrank*(N/P), k=0; k<N/P; i++, k++)for(j=0;j<N;j++)

apart[k][j] = a[i][j];

for(iter = 1; iter <= 2*MAXITER; iter++){//////ADI forward & backword sweep along rows//////for (i = 0; i < N/P; i++){

for (j = 1; j < N; j++){xpart[i][j] = xpart[i][j]-xpart[i][j-1]*apart[i][j]/bpart[i][j-1];bpart[i][j]= bpart[i][j] - apart[i][j]*apart[i][j]/bpart[i][j-1];

}xpart[i][N-1] = xpart[i][N-1]/bpart[i][N-1];

}for (i = 0; i < N/P; i++){

for (j = N-2; j > 1; j--)xpart[i][j]=(xpart[i][j]-apart[i][j+1]*xpart[i][j+1])/bpart[i]

ADI MPI CodeMPI_Gather(xpart, N*N/P, MPI_FLOAT, x, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Gather(bpart, N*N/P, MPI_FLOAT, b, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

//transpose matricestrans(x, N, N);trans(b, N, N);trans(a, N, N);

MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

for(i=myrank*(N/P), k=0; k<N/P; i++, k++)for(j=0;j<N;j++)

apart[k][j] = a[i][j];}

ADI MPI Performance

128 256 384 512 640 7680

0.10.20.30.40.50.60.70.80.9

Space Size (N)

Speedup

128 256 384 512 640 7680

Space Size (N)

ADI mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVGather 1 8.63e+04 22.83 23.54 0.00Gather 3 6.29e+04 16.63 17.15 0.00Gather 2 6.08e+04 16.10 16.60 0.00Gather 4 5.83e+04 15.43 15.91 0.00Scatter 4 3.31e+04 8.76 9.03 0.00Scatter 2 3.08e+04 8.14 8.39 0.00Scatter 3 2.87e+04 7.58 7.81 0.00Scatter 1 5.53e+03 1.46 1.51 0.00Bcast 2 50.8 0.01 0.01 0.00Bcast 4 50.8 0.01 0.01 0.00Bcast 3 49.5 0.01 0.01 0.00Bcast 1 40.4 0.01 0.01 0.00Reduce 1 2.57 0.00 0.00 0.00Reduce 3 0.259 0.00 0.00 0.00Reduce 2 0.056 0.00 0.00 0.00Reduce 4 0.052 0.00 0.00 0.00

ADI MPI+OpenMP CodeMPI_Bcast(a, N * N, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

omp_set_num_threads(T);

#pragma omp parallel private(iter){int id, sindex, eindex;int m,n;id = omp_get_thread_num();

sindex = id * node_rows/T;eindex = sindex + node_rows/T;int l = myrank*(N/P);

for(m=sindex; m<eindex; m++){for(n=0;n<N;n++)

apart[m][n] = a[l+m][n];l++;

ADI MPI+OpenMP Codefor(iter = 1; iter <= 2*MAXITER; iter++){//////ADI forward & backword sweep along rows//////#pragma omp for private(i,j) nowaitfor (i = 0; i < N/P; i++){

for (j = 1; j < N; j++){xpart[i][j] = xpart[i][j]-xpart[i][j-1]*apart[i][j]/bpart[i][j-1];bpart[i][j]= bpart[i][j] - apart[i][j]*apart[i][j]/bpart[i][j-1];

}xpart[i][N-1] = xpart[i][N-1]/bpart[i][N-1];

#pragma omp for private(i,j)for (i = 0; i < N/P; i++)

for (j = N-2; j > 1; j--)xpart[i][j]=(xpart[i][j]-apart[i][j+1]*xpart[i][j+1])/bpart[i][j];

#pragma omp master{

MPI_Gather(xpart, N*N/P, MPI_FLOAT, x, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Gather(bpart, N*N/P, MPI_FLOAT, b, N*N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);

#pragma omp barrier

ADI MPI+OpenMP Code#pragma omp sections{

#pragma omp section{ trans(x, N, N); }#pragma omp section{ trans(b, N, N); }#pragma omp section{ trans(a, N, N); }

}#pragma omp barrier

#pragma omp master{MPI_Scatter(x, N * N/P, MPI_FLOAT, xpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);MPI_Scatter(b, N * N/P, MPI_FLOAT, bpart, N * N/P, MPI_FLOAT, 0, MPI_COMM_WORLD);}

l = myrank*(N/P);for(m=sindex; m<eindex; m++){

for(n=0;n<N;n++)apart[m][n] = a[l+m][n];

l++;}}#pragma omp barrier}

ADI MPI+OpenMP Performance

128 256 384 512 640 7680

0.20.40.60.8

11.21.41.61.8

MPI+OpenMP

Space Size (N)

Speedup

128 256 384 512 640 7680

100200300400500600700800900

MPI+OpenMP

Space Size (N)

128 256 384 512 640 7680

MPI+OpenMP

Space Size (N)

ADI ompP results

R00002 adi_mpi_scatter_openmp.c (89-96) LOOP TID execT execC bodyT exitBarT taskT 0 0.05 200 0.05 0.00 0.00 1 0.05 200 0.05 0.00 0.00 2 0.08 200 0.08 0.00 0.00 3 0.08 200 0.08 0.00 0.00 4 0.08 200 0.08 0.00 0.00 5 0.08 200 0.08 0.00 0.00 6 0.08 200 0.08 0.00 0.00 7 0.08 200 0.08 0.00 0.00 SUM 0.58 1600 0.58 0.00 0.00

R00003 adi_mpi_scatter_openmp.c (99-104) LOOP TID execT execC bodyT exitBarT taskT 0 0.06 200 0.05 0.01 0.00 1 34.23 200 0.05 34.18 0.00 2 34.22 200 0.05 34.17 0.00 3 34.22 200 0.05 34.17 0.00 4 34.21 200 0.05 34.16 0.00 5 34.20 200 0.05 34.15 0.00 6 34.21 200 0.05 34.16 0.00 7 34.20 200 0.05 34.15 0.00 SUM 239.54 1600 0.39 239.14 0.00

ADI ompP results

R00005 adi_mpi_scatter_openmp.c (113) BARRIER TID execT execC taskT 0 0.00 200 0.00 1 64.29 200 0.00 2 64.29 200 0.00 3 64.29 200 0.00 4 64.29 200 0.00 5 64.29 200 0.00 6 64.29 200 0.00 7 64.29 200 0.00 SUM 450.02 1600 0.00

R00004 adi_mpi_scatter_openmp.c (106-111) MASTER TID execT execC 0 64.28 200 SUM 64.28 200

R00006 adi_mpi_scatter_openmp.c (116-130) SECTIONS TID execT execC sectT sectC exitBarT mgmtT taskT 0 0.85 200 0.85 200 0.00 0.00 0.00 1 0.85 200 0.83 200 0.02 0.00 0.00 2 0.85 200 0.44 200 0.41 0.00 0.00 3 0.85 200 0.00 0 0.85 0.00 0.00 4 0.85 200 0.00 0 0.85 0.00 0.00 5 0.85 200 0.00 0 0.85 0.00 0.00 6 0.85 200 0.00 0 0.85 0.00 0.00 7 0.85 200 0.00 0 0.85 0.00 0.00 SUM 6.80 1600 2.12 600 4.67 0.01 0.00

ADI ompP results

R00008 adi_mpi_scatter_openmp.c (134-138) MASTER TID execT execC 0 34.46 200 SUM 34.46 200

ADI mpiP results

---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI% COVGather 2 8.98e+04 23.32 23.52 0.00Gather 6 6.57e+04 17.05 17.19 0.00Gather 8 6.45e+04 16.74 16.89 0.00Gather 4 6.17e+04 16.03 16.16 0.00Scatter 4 3.39e+04 8.79 8.87 0.00Scatter 8 3.1e+04 8.06 8.13 0.00Scatter 6 2.96e+04 7.68 7.75 0.00Scatter 2 5.4e+03 1.40 1.41 0.00Bcast 7 49.5 0.01 0.01 0.00Bcast 3 49.3 0.01 0.01 0.00Bcast 5 47.8 0.01 0.01 0.00Bcast 1 40 0.01 0.01 0.00Scatter 1 30.5 0.01 0.01 0.00Scatter 5 30.3 0.01 0.01 0.00Scatter 7 30.3 0.01 0.01 0.00Scatter 3 28.8 0.01 0.01 0.00Reduce 1 1.8 0.00 0.00 0.00Reduce 5 0.062 0.00 0.00 0.00Reduce 3 0.049 0.00 0.00 0.00Reduce 7 0.049 0.00 0.00 0.00

THANKS

Q & A Any Suggestions?

i nvestigate and p arallel p rocessing using e1350 ibm e s erver c luster ayaz ul hassan khan...

openmp mpi mpi openmp

j acobi

j nowait

adi openmp performance

adi openmp code

mpi openmp programs

mpi profiling slide

jbij slide

Documents

pgl4 luciferase reporter vectors - promega...revised 11/20...

erver diagnostics

s upermicro- s erver

20 -...

practice labs for vmware - certiport · vmware i 5.5 i...

erver: log-in procedure for utside source interlink...

introduction to j ava s erver p ages technology by naomi...

i nit o ffice s erver - ios solução desenvolvida para...

telematics user with video erver watching video clip clips...

avepoint casestudy roche diagnostics jp...windows server...

サーバベース・コンバータ v2.0 （ s erver b...

c lou d s erver cab inet · 2015-04-10 · • 17664...

to he db enves erver - denver astrothis month's featured...

hpe reference configuration: exchange 2016, …exchange...

· mente y con un alto nivel de aceptación de los ......

nvironment anager ersonalization erver best practice guide

installing microsoft® sql server 2012 for wonderware...

microsoft it...

b e a r esponsible s eller/ s erver ( bars ) training...

n ode.j s s erver s ide j avascript diana roiswati (...