lecture 8: openmp. parallel programming models parallel programming models: data parallelism / task...

of 38/38
Lecture 8: Lecture 8: OpenMP OpenMP

Post on 30-Dec-2015

240 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Lecture 8:

    OpenMP

  • Parallel Programming Models

    Parallel Programming Models:

    Data parallelism / Task parallelismExplicit parallelism / Implicit parallelism Shared memory / Distributed memoryOther programming paradigmsObject-oriented Functional and logic

  • Parallel Programming Models

    Shared MemoryThe programmers task is to specify the activities of a set of processes that communicate by reading and writing shared memory. Advantage: the programmer need not be concerned with data-distribution issues. Disadvantage: performance implementations may be difficult on computers that lack hardware support for shared memory, and race conditions tend to arise more easily

    Distributed MemoryProcesses have only local memory and must use some other mechanism (e.g., message passing or remote procedure call) to exchange information.Advantage: programmers have explicit control over data distribution and communication.

  • Shared vs Distributed Memory

    Shared memory

    Distributed memory

    MemoryBusPPPPPPPPMMMMNetwork

  • Parallel Programming Models

    Parallel Programming Tools:

    Parallel Virtual Machine (PVM)Distributed memory, explicit parallelismMessage-Passing Interface (MPI)Distributed memory, explicit parallelismPThreadsShared memory, explicit parallelismOpenMPShared memory, explicit parallelismHigh-Performance Fortran (HPF)Implicit parallelismParallelizing CompilersImplicit parallelism

  • Parallel Programming Models

    Shared Memory Model

    Used on Shared memory MIMD architectures

    Program consists of many independent threads

    Concurrently executing threads all share a single, common address space.

    Threads can exchange information by reading and writing to memory using normal variable assignment operations

  • Parallel Programming Models

    Memory Coherence Problem

    To ensure that the latest value of a variable updated in one thread is used when that same variable is accessed in another thread.

    Hardware support and compiler support are required

    Cache-coherency protocol

    Thread 1Thread 2X

  • Parallel Programming Models

    Distributed Shared Memory (DSM) Systems

    Implement Shared memory model on Distributed memory MIMD architectures

    Concurrently executing threads all share a single, common address space.

    Threads can exchange information by reading and writing to memory using normal variable assignment operations

    Use a message-passing layer as the means for communicating updated values throughout the system.

  • Parallel Programming Models

    Synchronization operations in Shared Memory Model

    MonitorsLocksCritical sectionsCondition variablesSemaphoresBarriers

  • OpenMP

    OpenMP

    www.openmp.org/

  • OpenMP

    Shared-memory programming modelThread-based parallelismFork/Join modelCompiler directive basedNo support for parallel I/O

  • OpenMP

    Master thread executes the sequential sections.

    Master thread forks additional threads.

    At the end of the parallel code, the created threads die and the control returns to the master thread (join). MasterThreadsForkJoinForkJoin

  • OpenMPGeneral Code Structure

    #include main () { int var1, var2, var3; Serial code . . .

    Fork a team of threads. Specify variable scoping #pragma omp parallel private(var1, var2) shared(var3) { Parallel section executed by all threads . . .

    All threads join master thread and disband } Resume serial code . . . }

  • OpenMPGeneral Code Structure

    #include main () { int nthreads, tid;

    #pragma omp parallel private(tid) /* Fork a team of threads */ {tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); if (tid == 0) {/* master thread */ nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ }

  • OpenMPparallel DirectiveThe execution of the code block after the parallel pragma is replicated among the threads.

    #include main () { struct job_struct job_ptr;struct task_struct task_ptr;

    #pragma omp parallel private(task_ptr) {task_ptr = get_next_task(&job_ptr);while(task_ptr != NULL){complete_task(task_ptr)task_ptr = get_next_task(&job_ptr);} }

    } job_ptrtask_ptrtask_ptrMaster treadThread 1Shared variables

  • OpenMPparallel for Directive

    #include

    main () { int i; float b[5];

    #pragma omp parallel for{for (i=0; i < 5; i++) b[i] = i; }

    } biiMaster Thread (0)Thread 1In parallel for, variables are by default shared with the exception that the loop index variable is private.

  • OpenMP

    Execution context: is an address space containing all of the variables the thread may access.

    Shared variable: has the same address in the execution context of every thread

    Private variable: has a different address in the execution context of every thread

  • OpenMPprivate Clause declares variables in its list to be private to each thread private (list)

    #include main () { int i,n; float a[10][10];

    n=10;#pragma omp parallel for \ private(j) {for (i=0; i < n; i++) for (j=0; j < n; j++) a[i][j] = a[i][j] + i; }

    }

  • OpenMPcritical DirectiveDirects the compiler to enforce mutual exclusion among the threads executing the block of code.

    #include main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } /* end of parallel section */

    }

  • OpenMPreduction Directive reduction (operator : variable)

    #include main () { int i, n; float a, x, p; n=100;a=0.0; #pragma omp parallel for \ private(x) \ reduction(+:a) for (i=0; i < n; i++) { x = i/10.0; a += x*x; }

    p = a/n;

    }

  • OpenMPreduction Operators

    +addition-subtraction*multiplication&bitwise and|bitwise or^bitwise exclusive or&&conditional and||conditional or

  • OpenMP

    Loop Scheduling: allows the iterations of a loop to be allocated to threads.

    Static schedule: all iterations are allocated to threads before they execute any loop iterations.Low overheadHigh load imbalance

    Dynamic schedule: only some of the iterations are allocated to threads at the beginning of loops execution. Threads that complete their iterations are then eligible to get additional work. Higher overheadReduce load imbalance

  • OpenMPschedule Clause schedule (type [, chunk])

    type: static, dynamic, etc.chunk: number of contiguous iterations assigned to each thread

    Increasing chunk size can reduce overhead and increase cache hit rate.

  • OpenMPschedule Clause

    #include main () { int i,n; float a[10];

    n=10;#pragma omp parallel for \ private(i) \schedule(static,5){for (i=0; i < n; i++) a[i] = a[i] + i; }

    }

  • OpenMPschedule Directive

    #include #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE;

    #pragma omp parallel shared(a,b,c,chunk) private(i) { /* iterations will be distributed dynamically in chunk sized pieces*/#pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++) c[i] = a[i] + b[i]; }} /* end of parallel section */ }

  • OpenMPnowait Clausetells the compiler to omit the barrier synchronization at the end of the parallel for loop

    #include #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE;

    #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++) c[i] = a[i] + b[i]; }}}

  • OpenMPfor Directivespecifies that the iterations of the loop immediately following it must be executed in parallel by the team.

    for (i=0; i < n; i++){low=a[i];high=b[i];if (low > high) {break;}for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }

    #pragma omp parallel private(i,j)for (i=0; i < n; i++){low=a[i];high=b[i];if (low > high) {break;}#pragma omp for for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }

  • OpenMPsingle DirectiveSpecifies that the enclosed code is to be executed by only one thread in the team.Threads in the team that do not execute the single directive wait at the end of the enclosed code block

    for (i=0; i < n; i++){low=a[i];high=b[i];if (low > high) {break;}for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }

    #pragma omp parallel private(i,j)for (i=0; i < n; i++){low=a[i];high=b[i];if (low > high) {#pragma omp singlebreak;}#pragma omp forfor (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }

  • OpenMP#include int a, b, i, tid; float x; #pragma omp threadprivate(a, x) main () { omp_set_dynamic(0); /*Explicitly turn off dynamic threads*/ #pragma omp parallel private(b,tid) { tid = omp_get_thread_num(); a = tid; b = tid; x = 1.1 * tid +1.0; printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x); } /* end of parallel section */ printf("Master thread doing serial work here\n"); #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x); } /* end of parallel section */

    }Output:

    Thread 0: a,b,x= 0 0 1.000000 Thread 2: a,b,x= 2 2 3.200000 Thread 3: a,b,x= 3 3 4.300000 Thread 1: a,b,x= 1 1 2.100000 Master thread doing serial work here Thread 0: a,b,x= 0 0 1.000000 Thread 3: a,b,x= 3 0 4.300000 Thread 1: a,b,x= 1 0 2.100000 Thread 2: a,b,x= 2 0 3.200000 THREADPRIVATE Directive The THREADPRIVATE directive is used to make global file scope variables (C/C++) local and persistent to a thread through the execution of multiple parallel regions.

  • OpenMPparallel sections Directive Functional ParallelismSpecifies that the enclosed section(s) of code are to be divided among the threads in the team to be evaluated concurrently.

    #include main () { ...#pragma omp parallel sections { #pragma omp section /* thread 1 */v = alpha(); #pragma omp section /* thread 2 */w = beta();#pragma omp section /* thread 3 */y = delta();} /* end of parallel section */ x = gamma(v,w);printf(%f\n, epsilon(x,y));}

    }

  • OpenMPparallel sections Directive Functional Parallelism

    Another solution:main () { ...#pragma omp parallel{ #pragma omp sections { #pragma omp section /* thread 1 */v = alpha(); #pragma omp section /* thread 2 */w = beta();}#pragma omp sections { #pragma omp section /* thread 3 */x = gamma(v,w); #pragma omp section /* thread 4 */y = delta();}} /* end of parallel section */ printf(%f\n, epsilon(x,y));}}

  • OpenMPsection Directive

    #include #define N 1000 main () { int i; float a[N], b[N], c[N], d[N]; for (i=0; i < N; i++) {a[i] = i * 1.5; b[i] = i + 22.35; }#pragma omp parallel shared(a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i]; } /* end of sections */ } /* end of parallel section */ }

  • OpenMP

    Synchronization constructs:

    master directive: specifies a region that is to be executed only by the master thread of the team

    critical directive: specifies a region of code that must be executed by only one thread at a time

    barrier directive: synchronizes all threads in the team

    atomic directive: specifies that a specific memory location must be updates atomically (a mini critical section)

  • OpenMPbarrier Directive

    #pragma omp barrier ... ;

    atomic Directive

    #pragma omp atomic ... ;

  • OpenMPRun-time Library Routines

    omp_set_num_threads(void): Sets the number of threads that will be used in the next parallel region omp_get_num_threads(void): Returns the number of threads that are currently executing the parallel region omp_get_thread_number(void): Returns the thread numberomp_get_num_procs(void): Returns the number of processors omp_in_parallel(void): used to determine if the section of code is parallel or not

  • Parallel Programming Models

    Example: Pi calculation

    P = f01 f(x) dx = f01 4/(1+x2) dx = w f(xi) f(x) = 4/(1+x2) n = 10 w = 1/nxi = w(i-0.5)xf(x)0 0.1 0.2 xi 1

  • Parallel Programming ModelsSequential Code

    #define f(x) 4.0/(1.0+x*x);

    main(){intn,i;float w,x,sum,pi;

    printf(n?\n);scanf(%d, &n);w=1.0/n;sum=0.0;for (i=1; i

  • OpenMP#include #include #define f(x) 4.0/(1.0+x*x)#define NUM_THREADS 4

    main() { floatsum, w, pi, x;inti, n, id;sum = 0.0;w=1.0/n; omp_set_num_threads(NUM_THREADS);#pragma omp parallel for private(x) {for (i=0; i