aspects of practical parallel programming parallel programming models data parallel

of 53 /53
1 Parallel Computing 3 Models of Parallel Computations Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR

Author: peers

Post on 12-Jan-2016




2 download

Embed Size (px)


Parallel Computing 3 Models of Parallel Computations Ond řej Jakl Institute of Geonics, Academy of Sci. of the CR. Outline of the lecture. Aspects of practical parallel programming Parallel programming models Data parallel High Performance Fortran Shared variables/memory - PowerPoint PPT Presentation


  • Parallel Computing 3 Models of Parallel Computations

    Ondej Jakl

    Institute of Geonics, Academy of Sci. of the CR

  • Aspects of practical parallel programmingParallel programming modelsData parallelHigh Performance FortranShared variables/memory compilers support: automatic/assisted parallelization OpenMPthread librariesMessage passingOutline of the lecture

  • Primary goal: maximization of performancespecific approaches are expected to be more efficient than universal onesconsiderable diversity in parallel hardwaretechniques/tools are much more dependent on the target platform than in sequential programmingunderstanding the hardware will make it easier to make programs get high performanceback to the era of assembly programming?On the contrary, standard/portable/universal methods increase the productivity in software development and maintenanceParallel programming (1)Trade-off

  • Parallel programs are more difficult to write and debug than sequential onesparallel algorithms can be generally qualitatively different form the corresponding sequential onesthe change of the form of the code may be not enoughseveral new classes of potential software bugs (e.g. race conditions)difficult debuggingissues of scalabilityParallel programming (2)

  • Special programming language supporting concurrencytheoretically advantageous, in practice not as much popular ex.: Ada, Occam, Sisal, etc. (there are dozens of designs)language extensions: CC++, Fortran M, etc.Universal programming language (C, Fortran,...) with parallelizing compiler autodetection of parallelism in the sequential code easier for shared memory, limited efficiency matter of future? (despite of 30 years of intense research) ex.: Forge90 for Fortran (1992), some standard compilersUniversal programming language plus a library of external parallelizing functionsmainstream nowadays ex.: PVM (Parallel Virtual Machine), MPI (Message Passing Interface), Pthreads a. o.General approaches

  • A parallel programming model is a set of software technologies to express parallel algorithms and match applications with the underlying parallel systems [Wikipedia]Considered models:data parallel [just introductory info in this course]shared variables/memory [related to the OpenMP lecture in part II of the course] message passing [continued in the next lecture (MPI)]

    Parallel programming models

  • Data parallel model

  • Assumed underlying hardware: multicomputer or multiprocessororiginally associated with SIMD machines such as CM-200 multiple processing elements perform the same operation on multiple data simultaneouslyarray processors

    Hardware requirements[Wikipedia]

  • Based on concept of applying the same operation (e.g. add 1 to every array element) to a number of a data ensemble in parallela set of tasks operate collectively on the same data structure (usually an array) each task on a different partitionOn multicomputers the data structure is split up and resides as chunks in the local memory of each task On multiprocessors, all tasks may have access to the data structure through global memory The tasks are loosely synchronizedat the beginning and end of the parallel operationsSPMD execution modelData parallel model

  • Higher-level parallel programmingdata distribution and communication done by compilertransfer low-level details from programmer to compilercompiler converts the program into standard code with calls to a message passing library (MPI usually); all message passing is done invisibly to the programmerEase of usesimple to write, debug and maintainno explicit message passingsingle-threaded control (no spawn, fork, etc.)Restricted flexibility and controlonly suitable for certain applicationsdata in large arrayssimilar independent operations on each elementnaturally load-balancedharder to get top performancereliant on good compilersCharacteristics

  • The best known representative of data parallel programming languageHPF version 1.0 in 1993 (extends Fortran 90), version 2.0 in 1997Extensions to Fortran 90 to support data parallel model, includingdirectives to tell compiler how to distribute dataDISTRIBUTE, ALIGN directivesignored as comments in serial Fortran compilersmathematical operations on array-valued argumentsreduction operations on arraysFORALL constructassertions that can improve optimization of generated codeINDEPENDENT directiveadditional intrinsics and library routinesAvailable e.g. in the Portland Group PGI Workstation package not frequently used High Performance Fortran

  • REAL A(12, 12) ! declarationREAL B(16, 16) ! of an arrays!HPF$ TEMPLATE T(16,16) ! and a template!HPF$ ALIGN B WITH T ! align B with T!HPF$ ALIGN A(i, j) WITH T(i+2, j+2) ! align A with T and shift!HPF$ PROCESSORS P(2, 2) ! declare number of procesors 2*2!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P ! distribution of arraysHPF data mapping example[Mozdren 2010]

  • Codistributed arraysParallel MATLAB (the MathWorks): Parallel Computing Toolbox plus Distributed Computing Server for greater parallel environmentsreleased in 2004; increasing popularitySome features coherent to the data parallel modelcodistributed arrays: arrays partitioned into segments, each of which resides in the workspace of a different taskallow to handle larger data sets than in a single MATLAB sessionsupport for more than 150 MATLAB functions (e.g. finding eigenvalues) in a very similar way as with regular arraysparallel FOR loop: loop iterations without enforcing their particular orderingdistributes loop iterations over a set of tasksiterations must be independent of each otherData parallel in MATLAB parfor i = (1:nsteps) x = i * step; s = s + (4 /(1 + x^2)); end

  • Shared variables model

  • Assumed underlying hardware: multiprocessor collection of processors that share common memory interconnection fabric supporting single address spaceNot applicable to multicomputersbut: Intel Cluster OpenMPEasier to apply than message passingallows incremental parallelizationBased on the notion of threadsHardware requirementsafter [Wilkinson2004]

  • Thread vs. process (1)Process

  • Thread (lightweight processes) differs from (heavyweight) process:all threads in a process share the same memory spaceeach thread has a thread private area for its local variables e.g. stackthreads can work on shared data structuresthreads can communicate with each other via the shared dataThreads originally not targeted at the technical or HPC computinglow level, task (rather than data) parallelismDetails of thread/process relationship are very OS dependent

    Thread vs. process (2)

  • Parallel application generates, when appropriate, a set of cooperating threads usually one per processordistinguished by enumerationShared memory provides means to exchange data among threadsshared data can be accessed by all threadsno message passing necessaryThread communication

  • Threads execute their programs asynchronously Writes and reads are always nonblockingAccessing shared data needs careful controlneed some mechanisms to ensure that the actions occur in the correct ordere.g. write of A in thread 1 must occur before its read in thread 2Most common synchronization constructs:master section: a section of code executed by one thread onlye.g. initialisation, writing a filebarrier: all threads must arrive at a barrier before any thread can proceed past ite.g. delimiting phases of computation (e.g. a timestep)critical section: only one thread at a time can enter a section of codee.g. modification of shared variablesMakes shared-variables programming error-proneThread synchronization

  • Consider two threads each of which is to add 1 to a shared data item X, e.g. X = 10. read Xcompute X+1write X backIf step 1 is performed at the same time by both threads, the result will be 11 (instead of expected 12) Race condition: two or more threads (processes) are reading or writing shared data, and the result depends on who runs precisely whenX=X+1 must be atomic operationCan be ensured by mechanisms of mutual exclusione.g. critical section, mutex, lock, semaphore, monitor

    Accessing shared data

  • Initially only the master thread is activeexecutes sequential codeBasic operations: fork: master thread creates / awakens additional threads to execute in a parallel regionjoin: at end of parallel region created threads die / are suspendedDynamic thread creationthe number of active threads changes during executionfork is not an expensive operationSequential program a special / trivial case of a shared-memory parallel programFork/Join parallelism[Quinn 2004]

  • Compilers support: automatic parallelizationassisted parallelizationOpenMPThread libraries: POSIX threads, Windows threads

    [next slides]

    Computer realization

  • The code instrumented automatically by the compiler according the compilation flags and/or environment variablesParallelizes independent loops onlyprocessed by the prescribed number of parallel threadsUsually provided by Fortran compilers for multiprocessors as a rule proprietary solutionsSimple and sometimes fairly efficientApplicable to programs with a simple structureEx.: XL Fortran (IBM, AIX): -qsmp=auto option, XLSMPOPTS environment variable (the number of threads)Fortran (SUN, Solaris): -autopar flag, PARALLEL environment variablePGI C (Portland Group, Linux): -Mconcur flag Automatic parallelization

  • The programmer provides the compiler with additional information by adding compiler directives special lines of source code with meaning only to a compiler that understands themin the form of stylized Fortran comments or #pragma in Cignored by nonparallelizing compilersAssertive and prescriptive directives [next slides] Diverse formats of the parallelizing directives, but similar capabilities standard required

    Assisted parallelization

  • Hints that state facts that the compiler might not guess from the code itself Evaluation context dependent Ex.: XL Fortran (IBM, AIX)no dependencies (the references in the loop do not overlap, parallelization possible): !SMP$ ASSERT (NODEPS)trip count (average number of iterations of the loop; helps to decide if unroll or parallelize the loop): !SMP$ ASSERT (INTERCNT(100))Assertive directives

  • Instructions for the parallelizing compiler, which it must obeyclauses may specify additional informationA means for manual parallelization Ex.: XL Fortran (IBM, AIX)parallel region: defines a block of code that can be executed by a team of threads concurrentlyparallel loop: enables to specify which loops the compiler should parallelize

    Besides directives, additional constructs within the base language to express parallelism can be introducede.g. the forall statement in Fortran 95Prescriptive directives!SMP$ PARALLEL !SMP$ END PARALLEL!SMP$ PARALLEL DO !SMP$ END PARALLEL DO

  • API for writing portable multithreaded applications based on the shared variables modelmaster thread spawns a team of threads as needed relatively high level (compared to thread libraries)A standard developed by the OpenMP Architecture Review Boardhttp://www.openmp.orgfirst specification in 1997A set of compiler directives and library routinesLanguage interfaces for Fortran, C and C++OpenMP-like interfaces for other languages (e.g. Java)Parallelism can be added incrementallyi.e. the sequential program evolves into a parallel programsingle source code for both the sequential and parallel versionsOpenMP compilers available on most platforms (Unix, Windows, etc.)

    [More in a special lecture]OpenMP

  • Collection of routines to create, manage, and coordinate threadsMain representatives: POSIX threads (Pthreads), Windows threads (Windows (Win32) API)Explicit threading not primarily intended for parallel programminglow level, quite complex coding

    Thread libraries

  • Numerical integration based on the rectangle method:

    set n (number of strips)for each stripcalculate the height y of the strip (rectangle) at its midpointsum all y to the result Sendformultiply S by the width of the stripsprint result

    Example: PI calculationCalculation of by the numerical integration formula

  • /* Pi, Win32 API */#include #define NUM_THREADS 2 HANDLE thread_handles[NUM_THREADS]; CRITICAL_SECTION hUpdateMutex; static long num_steps = 100000; double step, global_sum = 0.0;

    void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0/(double)num_steps; for (i = start; i

  • void main () { double pi; int i; DWORD threadID; int threadArg[NUM_THREADS]; for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; InitializeCriticalSection(&hUpdateMutex); for (i = 0; i < NUM_THREADS; i++) { thread_handles[i] = CreateThread(0,0,(LPTHREAD_START_ROUTINE)Pi, &threadArg[i],0,&threadID); } WaitForMultipleObjects(NUM_THREADS,thread_handles,TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi);} PI in Windows threads (2)

  • /* Pi , pthreads library */ #define _REENTRANT #include #include #include #define NUM_THREADS 2 pthread_t thread_handles[NUM_THREADS]; pthread_mutex_t hUpdateMutex; pthread_attr_t attr; static long num_steps = 100000; double step, global_sum = 0.0; void* Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0 / (double)num_steps; for (i = start; i
  • void main () { double pi; int i; int retval; pthread_t threadID; int threadArg[NUM_THREADS]; pthread_attr_init(&attr); pthread_attr_setscope(&attr,PTHREAD_SCOPE_SYSTEM); pthread_mutex_init(&hUpdateMutex,NULL); for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; for (i = 0; i < NUM_THREADS; i++) { retval = pthread_create(&threadID,NULL,Pi,&threadArg[i]); thread_handles[i] = threadID; } for (i=0; i
  • /* Pi, OpenMP, using parallel for and reduction */ #include #include #include #define NUM_THREADS 2 static long num_steps = 1000000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); PI in OpenMP (1)

  • #pragma omp parallel for reduction(+:sum) private(x) for (i = 1; i < num_steps; i++){ x = (i - 0.5) * step; sum += 4.0 / (1.0 + x*x); } pi = sum * step; printf("Pi is %.10f \n",pi);}

    NB: Programs such as PI calculation are likely to be successfully parallelized through automatic parallelization as well

    PI in OpenMP (1)

  • Message passing model

  • Assumed underlying hardware: multicomputercollection of processors, each with its own local memory interconnection network supporting message transfer between every pair of processorsSupported by all (parallel) architectures the most general modelnaturally fits multicomputerseasily implemented on multiprocessorsComplete control: data distribution and communicationMay not be easy to apply sequential-to-parallel transformation requires major effortone giant step rather than many tiny stepsmessage passing = assembler of parallel computingHardware requirements[Quinn2004]

  • Parallel application generates (next slide) a set of cooperating processes process = instance of o running programusually one per processordistinguished by the unique ID number rank (MPI), tid (PVM), etc.To solve a problem, processes alternately perform computations and exchange messagesbasic operations: send, receiveno shared memory space necessaryMessages transport the contents of variables of one process to variables of other process.Message passing has also a synchronization functionMessage passing

  • Static process creation fixed number of processes in timespecified before the execution (e.g. on the command line)usually the processes follow the same code, but their control paths through the code can differ depending on the IDSPMD (Single Program Multiple Data) modelone master process (ID 0) several slave processesDynamic process creationvarying number of processes in timejust one process at the beginning processes can create (destroy) other processes: the spawn operationrather expensive!the processes often differ in codeMPMD (Multiple Program Multiple Data) modelProcess creation[Wilkinson2004]

  • Exactly two processes are involvedOne process (sender / source) sends a message and another process (receiver / destination) receives it active participation of processes on both sides usually requiredtwo-sided communicationIn general, the source and destination processes operate asynchronously the source may complete sending a message long before the destination gets around to receiving itthe destination may initiate receiving a message that has not yet been sent The order of messages is guaranteed (they do not overtake)Examples of technical issues handling more messages waiting to be receivedsending complex data structuresusing message bufferssend and receive routines blocking vs. nonblocking Point-to-point communication

  • Blocking operation: only returns (from the subroutine call) when the operation has completed ex.: sending fax on a standard machineNonblocking operation: returns immediately, the operation need not be completed yet, other work may be performed in the meantime the completion of the operation can/must be testedex.: sending fax on a machine with memory

    Synchronous send: does not complete until the message has been received provides (synchronizing) info about the message delivery ex.: sending fax (on a standard machine)Asynchronous send: completes as soon as the message is on its waysender only knows when the message has leftex.: sending a letter(Non-)blocking & (a-)synchronous

  • Transfer of data in a set of processes Provided by most message passing systemsBasic operations [next slides]:barrier: synchronization of processes broadcast: one-to-many communication of the same data scatter: one-to-many communication of different portions of data gather: many-to-one communication of the (different, but related) datareduction: gather plus combination of data with arithmetic/logical operationRoot in some collective operations, the single prominent source / destination e.g. in broadcastCollective operations can be built out as a set of point-to-point operations, but these blackbox routineshide a lot of the messy detailsare usually more efficientcan take advantage of special communication hardware

    Collective communication

  • A basic mechanism for synchronizing processes Inserted at the point in each process where it must wait for the othersAll processes can continue from this point when all the processes have reached it or when a stated number of processes have reached this pointOften involved in other operationsBarrier[Wilkinson2004]

  • Distributes the same piece of data from a single source (root) to all processes (concerned with problem)multicast sending the message to a defined group of processes


  • Distributes each element of an array in the root to a separate processincluding the rootcontents of the ith array element sent to the ith processScatter

  • Collects data from each process at the rootvalue from the ith process is stored in the ith array element (rank order)Gather

  • Gather operation combined with specified arithmetic/logical operationcollect data from each processorreduce these data to a single value (such as a sum or max)store the reduced result on the root processor


  • Computer realization of the message passing modelMost popular message passing systems (MPS): Message Passing Interface (MPI) [next lecture]Parallel Virtual Machine (PVM)in distributed computing Corba, Java RMI, DCOM, etc.Message passing system (1)

  • Information needed by MPS to transfer a message include: sending process and location, type and amount of transferred data no interest in data itself (message body)receiving process(-es) and storage to receive the data Most of this information is attached as message envelope may be (partially) available to the receiving processMPS may provide various information to the processese.g. about the progress of communicationA lot of other technical aspects, e. g.: process enrolment in MPSaddressing schemecontent of the envelope using message buffers (system, user space)Message passing system (2)

  • Message passing (MPI)easier to debug easiest to optimizecan overlap communication and computationpotential to high scalabilitysupport on all parallel architecturesharder to programload balancing, deadlock prevention, etc. need to be addressed most freedom and responsibilityWWW (what, when, why)Shared variables (OMP)easier to program than MP, code is simplerimplementation can be incrementalno message start-up costscan cope with irregular communication patternslimited to shared-memory systemsharder to debug and optimizescalability limitedusually less efficient than MP equivalentsData parallel (HPF)easier to program than MPsimpler to debug than SVdoes not require shared memoryDP style suitable only for certain applicationsrestricted control over data and work distributiondifficult to obtain top performancea few APIs availableout of date?

  • The definition of parallel programming models is not uniform in literature; other models can be e.g.thread programming modelhybrid models, e. g. the combination of the message passing and shared variables modelexplicit message passing between the nodes of a cluster as well as shared-memory and multithreading within the nodesModels continue to evolve along with the changing world of computer hardware and software CUDA parallel programming model for CUDA GPU architecture Conclusions

  • The message passing model and shared variables model somehow treated in all general textbooks on parallel programmingexception: [Foster 1995] almost skips data sharingThere are plenty of books dedicated to shared objects, synchronisation and shared memory, e.g. [Andrews 2000] Foundations of Multithreaded, Parallel, and Distributed Programmingnot necessarily focusing on parallel processingData parallelism is usually a marginal topicFurther study

  • snad OKComments on the lecture

    Kapuzbashi Waterfalls, foto Ceco

    Amarasinghe, conclusionsLin p. HPF slides 6,... Mozdren PA 2009-10, shared memory modelWilkinson Slides 8EPCC IntroHPC slides 13EPCC IntroHPC slides 16EPCC IntroHPC slides 17-18

    EPCC IntroHPC slides 17-18

    Quinn slides17

    Quinn slides17

    Chapman p. 15Quinn slides17

    TRACS HPF slides 5Wilkinson 2004 p. 39 difference between blocking synchronous and asynchronous send much greater then the difference between nonblocking synchronous and asynchronous send Wilkinson 2004 ch. 6??? prehodit skatulky na obrazku??? Upravit jen na jednu reduk. hodnotuWilkinson 2004 p. 39prehled MPS na HPC Intro, HPF (slides, notes)