2009 us array

Post on 11-May-2015

401 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

USArray Data Processing WorkshopBig Iron and Parallel Processing

Scott Teige, PhDJuly 30, 2009

July 30, 2009USArray Data Processing Workshop

Overview• How big is “Big Iron”?• Where is it, what is it?• One system, the details• Parallelism, the way forward• Scaling and what it means to you• Programming techniques• Examples• Excercises

July 30, 2009USArray Data Processing Workshop

What is the TeraGrid?• “… a nationally distributed

cyberinfrastructure that provides leading edge computational and data services for scientific discovery through research and education…”

• A document exists in your training account home directories.

July 30, 2009USArray Data Processing Workshop

Some TeraGrid SystemsKraken NICS Cray 608 TF 128 TBRanger TACC Sun 579 123Abe NCSA Dell 89 9.4Lonestar TACC Dell 62 11.6

Steele Purdue Dell 60 12.4Queen Bee LONI Dell 50 5.3

Lincoln NCSA Dell 47 3.0BigRed IU IBM 30 6.0

July 30, 2009USArray Data Processing Workshop

System LayoutKraken 2.30 GHz 66048 cores

Ranger 2.66 62976

Abe 2.33 9600

Lonestar 2.66 5840

Steele 2.33 7144

July 30, 2009USArray Data Processing Workshop

AvailabilityKraken 608TFLOPS 96% Use 24.3 IdleTFRanger 579 91% 52.2Abe 89 90% 8.9Lonestar 62 92% 5.0Steele 60 67% 19.8Queen Bee 51 95% 2.5Lincoln 48 4% 45.6Big Red 31 83% 5.2

July 30, 2009USArray Data Processing Workshop

Research CyberinfrastructureThe Big Picture:• Compute

Big Red (IBM e1350 Blade Center JS21)Quarry (IBM e1350 Blade Center HS21)

• StorageHPSSGPFSOpenAFSLustreLustre/WAN

July 30, 2009USArray Data Processing Workshop

High Performance Systems• Big Red [TeraGrid System]

30 TFLOPS IBM JS21 SuSE Cluster 768 blades/3072 cores: 2.5 GHz PPC 970MP8GB Memory, 4 cores per bladeMyrinet 2000LoadLeveler & Moab

• Quarry [Future TeraGrid System]7 TFLOPS IBM HS21 RHEL Cluster140 blades/1120 cores: 2.0 GHz Intel Xeon 53358GB Memory, 8 cores per blade1Gb Ethernet (upgrading to 10Gb)PBS (Torque) & Moab

July 30, 2009USArray Data Processing Workshop

July 30, 2009USArray Data Processing Workshop

Data Capacitor (AKA Lustre)High Performance Parallel File system

-ca 1.2PB spinning disk-local and WAN capabilities

SC07 Bandwidth Challenge Winner-moved 18.2 Gbps across a single 10Gbps link

July 30, 2009USArray Data Processing Workshop

HPSS• High Performance Storage System• ca. 3 PB tape storage• 75 TB front-side disk cache• Ability to mirror data between IUPUI and

IUB campuses

July 30, 2009USArray Data Processing Workshop

Serial vs. Parallel• Calculation• Flow Control• I/O

• Calculation• Flow Control• I/O• Synchronization• Communication

July 30, 2009USArray Data Processing Workshop

A SerialProgram

F

1-F

F/N

1-F

S=1/(1-F+F/N)

Amdahl’s Law:

Special case, F=1

S=N, Ideal Scaling

July 30, 2009USArray Data Processing Workshop

Speed for various scaling rules

S=Ne -(N-1)/q

“Paralyzable Process”

S>N

“Superlinear Scaling”

July 30, 2009USArray Data Processing Workshop

MPI vs. OpenMP• MPI code may

execute across many nodes

• Entire program is replicated for each core (sections may or may not execute)

• Variables not shared• Typically requires

structural modification to code

• OpenMP code executes only on the set of cores sharing memory

• Sections of code may be parallel or serial

• Variables may be shared

• Incremental parallelization is easy

July 30, 2009USArray Data Processing Workshop

Other methods exist:• Sockets• Explicit shared memory calls/operations• Pthreads• None are recommended

July 30, 2009USArray Data Processing Workshop

export OMP_NUM_THREADS=8icc mp_baby.c -openmp -o mp_baby./mp_baby

#include <stdio.h>#include <omp.h>

int main(int argc, char *argv[]) {int iam = 0, np = 1;

#pragma omp parallel default(shared) private(iam, np){

#if defined (_OPENMP)np = omp_get_num_threads();iam = omp_get_thread_num();

#endifprintf("Hello from thread %d out of %d\n", iam, np);

}}

Fork

Join

July 30, 2009USArray Data Processing Workshop

PROGRAM DOT_PRODUCT

INTEGER N, CHUNKSIZE, CHUNK, IPARAMETER (N=100)PARAMETER (CHUNKSIZE=10)REAL A(N), B(N), RESULT

! Some initializationsDO I = 1, N

A(I) = I * 1.0B(I) = I * 2.0

ENDDORESULT= 0.0CHUNK = CHUNKSIZE

!$OMP PARALLEL DO!$OMP& DEFAULT(SHARED) PRIVATE(I)!$OMP& SCHEDULE(STATIC,CHUNK)!$OMP& REDUCTION(+:RESULT)

DO I = 1, NRESULT = RESULT + (A(I) * B(I))

ENDDO

!$OMP END PARALLEL DO NOWAIT

PRINT *, 'Final Result= ', RESULTEND

Fork

Join

July 30, 2009USArray Data Processing Workshop

Synchronization Constructs• MASTER: block executed only by master

thread• CRITICAL: block executed by one thread

at a time• BARRIER: each thread waits until all

threads reach the barrier• ORDERED: block executed sequentially

by threads

July 30, 2009USArray Data Processing Workshop

Data Scope Attribute Clauses• SHARED: variable is shared across all

threads• PRIVATE: variable is replicated in each

thread• DEFAULT: change the default scoping of

all variables in a region

July 30, 2009USArray Data Processing Workshop

Some Useful Library routines• omp_set_num_threads(integer)• omp_get_num_threads()• omp_get_max_threads()• omp_get_thread_num()• Others are implementation dependent

July 30, 2009USArray Data Processing Workshop

OpenMP Advice• Always explicitly scope variables• Never branch into/out of a parallel region• Never put a barrier in an if block• Quarry is at OpenMP version <3.0, TASK

construct, for example, not there

July 30, 2009USArray Data Processing Workshop

Exercise: OpenMP

• The example programs are in ~/OMP_F_examples or ~/OMP_C_examples

• Go to https://computing.llnl.gov/tutorials/openMP/excercise.html• Skip to step 4, compiler is “icc” or “ifort”• There is no evaluation form

July 30, 2009USArray Data Processing Workshop

#include <stdio.h>#include <stdlib.h>#include <mpi.h>int myrank;int ntasks;

int main(int argc, char **argv){

/* Initialize MPI */MPI_Init(&argc, &argv);

/* get number of workers */MPI_Comm_size(MPI_COMM_WORLD, &ntasks);

/* Find out my identity in the default communicatoreach task gets a unique rank between 0 and ntasks-1 */

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

MPI_Barrier(MPI_COMM_WORLD);

fprintf(stdout,"Hello from MPI_BABY=%d\n",myrank);MPI_Finalize();exit(0);

}

… …

Node 1 Node 2 …

July 30, 2009USArray Data Processing Workshop

mpicc mpi_baby.c –o mpi_baby

mpirun –np 8 mpi_baby

mpirun –np 32 –machinefile my_list mpi_baby

July 30, 2009USArray Data Processing Workshop

C AUTHOR: Blaise Barneyprogram scatterinclude 'mpif.h'integer SIZEparameter(SIZE=4)integer numtasks, rank, sendcount, recvcount, source, ierrreal*4 sendbuf(SIZE,SIZE), recvbuf(SIZE)

C Fortran stores this array in column major order, so the C scatter will actually scatter columns, not rows.

data sendbuf /1.0, 2.0, 3.0, 4.0, & 5.0, 6.0, 7.0, 8.0,& 9.0, 10.0, 11.0, 12.0, & 13.0, 14.0, 15.0, 16.0 /call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)if (numtasks .eq. SIZE) then

source = 1sendcount = SIZErecvcount = SIZEcall MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf,

& recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr)print *, 'rank= ',rank,' Results: ',recvbuf

elseprint *, 'Must specify',SIZE,' processors. Terminating.'

endifcall MPI_FINALIZE(ierr)end

From the man page:

MPI_Scatter - Sends data from one task to all tasks in a group

…message is split into n equal segments, the ith segment is sent to the ith process in the group

July 30, 2009USArray Data Processing Workshop

man -w MPIls /N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/share/man/man3

MPI_AbortMPI_AllgatherMPI_AllreduceMPI_Alltoall...MPI_WaitMPI_WaitallMPI_WaitanyMPI_Waitsome

mpicc --showme/N/soft/linux-rhel4-x86_64/intel/cce/10.1.022/bin/icc \-I/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/include \-pthread -L/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/lib \-lmpi -lopen-rte -lopen-pal -ltorque -lnuma -ldl \-Wl,--export-dynamic -lnsl -lutil -ldl -Wl,-rpath -Wl,/usr/lib64

Some linux tricks to get more information:

July 30, 2009USArray Data Processing Workshop

MPI cool stuff:• Bi-directional communication• Non-blocking communication• User defined types• Virtual topologies

July 30, 2009USArray Data Processing Workshop

MPI Advice• Never put a barrier in an if block• Use care with non-blocking

communication, things can pile up fast

July 30, 2009USArray Data Processing Workshop

So, can I use MPI with OpenMP?• Yes you can; extreme care is advised• Some implementations of MPI forbid it• You can get killed by “oversubscription”

real fast, I’ve seen time increase like N2

• But sometimes you must… some fftw libraries are OMP multithreaded, for example.

July 30, 2009USArray Data Processing Workshop

Exercise: MPI• Examples are in ~/MPI_F_examples or ~/MPI_C_examples• Go to https://computing.llnl.gov/tutorials/mpi/exercise.html• Skip to step 6. MPI compilers are “mpif90” and “mpicc”, normal

(serial) compilers are “ifort” and “icc”.• Compile your code: “make all” (Overrides section 9)• To run an mpi code: “mpirun –np 8 <exe>” …or…• “mpirun –np 16 –machinefile <ask me> <exe>”• Skip section 12• There is no evaluation form.

July 30, 2009USArray Data Processing Workshop

Where were those again?• https://computing.llnl.gov/tutorials/openMP/excercise.html• https://computing.llnl.gov/tutorials/mpi/exercise.html

July 30, 2009USArray Data Processing Workshop

Acknowledgements

• This material is based upon work supported by the National Science Foundation under Grant Numbers 0116050 and 0521433. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation (NSF).

• This work was support in part by the Indiana Metabolomics and Cytomics Initiative (METACyt). METACyt is supported in part by Lilly Endowment, Inc.

• This work was support in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment, Inc.

• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.

top related