introduction to parallel programming using openmp shared ... · compile & run an openmp program...

WestGrid – Compute Canada - Online Workshop 2017

Introduction to Parallel Programming using OpenMP

Shared Memory Parallel Programming

Part – II

Dr. Ali Kerrache

WestGrid, Univ. of Manitoba, WinnipegE-mail: [email protected]

What do you need?

Basic Knowledge of:

C / C++ and/or Fortran Compilers: GNU, Intel, … Compile, Debug & Run a program.

Utilities:

Text editor: vim, nano, … ssh client: PuTTy, MobaXterm …http://mobaxterm.mobatek.net/download.html

Access to Grex:

Compute Canada account.WestGrid account.

GrexSlides & Examples (available): https://www.westgrid.ca/events/intro_openmp_part_2

How to participate in this workshop?

Copy the examples to your current working directory:

$ cp –r /global/scratch/workshop/openmp-wg-pII-2017 .$ cd openmp-wg-pII-2017 && ls

Reserve a compute node; then export the number of threads:$ sh reserve_omp_node.sh [ user-name@nijk ~] $ (ijk: 001-316)qsub: waiting for job 10535369.yak.local to start

$ export OMP_NUM_THREADS=4 [bash shell]$ setenv OMP_NUM_THREADS 4 [tcsh shell]

current directory

Login to Grex:$ ssh [email protected]

[ user-name@tatanka ~] $[ user-name@bison ~] $

Parallel Computing Using OpenMPOutline:

Introduction

Review of the first part and some examples

Intermediate and some advanced OpenMP directives

Example of PBS script for OpenMP programs

Conclusions

Parallel ProgrammingConcurrency Parallelism

CPU 3

MEM 3

CPU 2

MEM 2

CPU 1

MEM 1

CPU 0

MEM 0

CPU 3CPU 2CPU 1CPU 0

SHARED MEMORY

Distributed Memory Machines Shared Memory Machines

MPI OpenMP

Definition of OpenMP: API

Compiler Directives Runtime Library Environment Variables

Directives to add to a serial program and interpreted at compile time.

Directivesexecuted at run time.

Directives introduced after compile time to control OpenMP program.

FORK

JOINSerial

RegionSerial

Region

Parallel Region

OpenMP: Fork – Join modelMaster thread spawns a team of

threads as needed.

Parallelism is added incrementally: the sequential program evolves into a parallel program.

Serial region: master threadParallel region: all threads (team of threads)

OpenMP has simple syntaxMost of the constructs in OpenMP are compiler directives or pragma:

C/C++: the pragma take the form:#pragma omp construct [clause [clause]…]

Fortran: the directives take one of the forms:

!$OMP construct [clause [clause]…]C$OMP construct [clause [clause]…]*$OMP construct [clause [clause]…]

For C/C++: #include <omp.h> For F90: use omp_lib For F77: include ‘omp_lib.h’

#include <omp.h>

block of C/C++ code();#pragma omp parallel{structured block of a C/C++ code;}another block of C/C++ code();

C/C++

use omp_lib ! include ‘omp_lib.h’

block of F90/F77 code!$omp parallelstructured block of Fortran code!$omp end parallelanother block of F90/F77 code

Fortran

Directives on multiple lines

#pragma omp parallel list-of-some-directives \list-of-other-directives \list-of some-other-directives

{structured block of C/C++ code;}

C/C++

!$omp parallel list-of-some-directives &!$omp list-of-other-directives &!$omp list-of some-other-directivesstructured block of Fortran code!$omp end parallel

Fortran

The list of directives continues on the next lines

The list of directives continues on the next lines

Compile & Run an OpenMP ProgramTo compile and enable OpenMP:GNU: add –fopenmp as option to compile programs. Intel compilers: add –openmp as option to compile programs.

Environment variables: OMP_NUM_THREADS If not specified: OpenMP will spawns one thread per hardware thread.

$ export OMP_NUM_THREADS=value [ bash shell ] $ setenv OMP_NUM_THREADS value [ tcsh shell ]

value: number of threads [ For example 4 ]

To run: $ ./exec_program or ./a.out

Conditional compilation

C/C++ and Fortran (last versions of OpenMP: 4.0)

Preprocessor macro _OPENMP for C/C++ and Fortran

#ifdef _OPENMPMyID = omp_get_thread_num();#endif

Special comment for Fortran preprocessor

!$ MyID = OMP_GET_THREAD_NUM()

Helpful check of serial and parallel version of the code

Taken into account when compiled with OpenMP. Ignored if compiled in serial mode.

Runtime Library

omp_set_num_threads(NTHREADS); Set number of threads

Get number of threads

Get thread rank

Get time

maxthreads = omp_get_max_threads();

ID = omp_get_thread_num();

time = omp_get_wtime();

nthreads = omp_get_num_threads();

Get max of threads

To learn more about runtime library in C/C++ or Fortran:

http://www.openmp.org/specifications/

Data Environment

C/C++: default ( shared | none )Fortran: default ( private | firstprivate | shared | none )

only a single instance of variables in shared memory. all threads have read and write access to these variables.

shared

Each thread allocates its own private copy of the data. These local copies only exist in parallel region. Undefined when entering or exiting the parallel region.

private

variables are also declared to be private. additionally, get initialized with value of original variable.firstprivate

declares variables as private. variables get value from the last iteration of the loop.lastprivate

It is highly recommended to use: default ( none )

Work sharing: loops and sections [section]

#pragma omp parallel{

#pragma omp for {calc();}

}#pragma omp parallel for { calc(); }

C/C++: Loops

!$omp parallel!$omp do!$omp end do!$omp end parallel

!$omp parallel do!$omp end parallel do

Fortran: Loops

#pragma omp parallel#pragma omp sections {

#pragma omp section{ some computation(); }#pragma omp section{ some computation(); }

}

C/C++: Sections / section

!$omp sections!$omp section

some computation!$omp end section!$omp section!$omp end section!$omp end sections

Fortran: Sections / section

Reduction construct in OpenMP Aggregating values from different threads is a common operationthat OpenMP has a special reduction variable: Similar to private and shared Reduction variables can support several types of operations: + - *

Syntax of the reduction clause: reduction (op : list)

Inside a parallel or a work-sharing construct: A local copy of each list of variables is made and initialized depending on the “op” : 0 for “+” or “-” Updates occur on the local copy. Local copies are reduced into a single value and combined with the original global value. The variables in “list” must be shared in the enclosing parallel region.

SynchronizationSynchronization: Bring threads to a well defined point in their execution. Barrier: each thread wait at the barrier until all threads arrive.Mutual exclusion: only one thread at a time can execute.

High level constructs: single: only one thread will execute the following block.master: only the master thread will execute the following block. critical: only one thread at a time will execute the following block. atomic: used for local updates barrier: all threads must arrive here before going further.

Barrier Mutual exclusion

Synchronization: can reduce the performance. cause overhead and cost a lot.more barriers will serialize the program. Use appropriate synchronization construct.

Barrier in OpenMP

The barrier directive explicitly synchronizes all the threads in

a team.

When encountered, each thread in the team waits until all the

others have reached this point.

There are also implicit barriers at the end of:

parallel region:

They can cannot be changed.

work share constructs: do/for, sections, single

They can be disabled by specifying nowait

C/C++:#pragma omp barrier

Fortran:!$ omp barrier

Review of some constructs and clauses

Work sharing: #pragma omp for #pragma omp sections

Runtime library: omp_set_num_threads(); omp_get_num_threads(); omp_get_thread_num(); omp_get_max_threads(); omp_get_wtime();

Variables: shared(list) private(list) firstprivate(list) lastprivate(list) reduction(op:list)

Synchronization: #pragma omp master #pragma omp single #pragma omp critical #pragma omp barrier #pragma omp atomic

Create threads in C/C++:#pragma omp parallel { structured block();}

Create threads in Fortran:!$ omp parallel structured block()!$omp end parallel

Nested Loops: collapse construct

#pragma omp parallel for collapse (3)for ( i = 0; I < N; i++) {

for ( j = 0; j < M; j++) {for ( k = 0; k < L; k++) {

block of C/C++ code; }

}}

C/C++!$omp parallel for collapse (3)do i = 1, N

do j = 0, Mdo k = 0, L

block of Fortran code end do

end doend do!$omp end parallel for

Fortran

More then one loop: use collapse(n) construct

Argument must be the number of the loops to collapse. It will form a single loop of N*M*L iterations.

Orphaned work sharing construct

void do_some_computation(int v[], int n) {int i;#pragma omp forfor (i=0;i<n;++i) { v[i] = 0; }

}

int main() {int size = 200; int v[size];

#pragma omp parallel{

do_some_computation(v,size); /* Case 1 */}do_some_computation(v,size); /* Case 2 */return 0;}

Example of orphaned construct in C/C++

Work-sharing construct can occur anywhere outside the lexical extent of a parallel region

orphaned construct: Case 1: called in a parallel context, it works as expected. Case 2: called in a sequential context , it “ignores” directive.

Switch off synchronization: nowait clause To ensure the correctness of computation, threads sometimes need to be synchronized. Critical region and atomic updates. At the end of parallel region and the loop construct.

To override the default behavior, ”nowait” clause can be used: Use "nowait" parameter for parallel loops which do not need to be synchronized upon exit of the loop. Keeps synchronization overhead low. Hint: barrier at the end of parallel region cannot be suppressed. In Fortran, "nowait" needs to be given in the end part. Check for data dependencies before "nowait" is being used. If there are multiple independent loops in a parallel region, having ”nowait” could improve the performance.

Example: nowait clause In the following example, the loops are independent. Using “nowait” will increase the performance.

#pragma omp parallel{#pragma omp for nowaitfor (i=0 ; i< Nsteps ; ++i) {

a[i] = b[i] + c[i] + d[i]; }

#pragma omp for nowaitfor (i=0 ; I < Nsteps ; ++i) {

z[i] = Function(b[i]); }

}

C/C++!$omp parallel!$omp dodo i = 1, Nsteps

a(i) = b(i) + c(i) + d (i)end do!$omp end do nowait!$omp dodo i = 1, Nsteps

z(i) = Function((b(i))en do!$omp end do nowait!$omp end parallel

Fortran

Conditional Threading Creating threads takes longer than doing the calculation by one thread. If (condition) clause is used for conditional threading to avoid extra overhead for small size problems ( add a condition to omp parallel).

if (n > 100) {#pragma omp parallel forfor (i = 0 ; i < n ; ++i) {

computations(); }

} else {for (i = 0 ; i < n ; ++i) {

Computations();}

}

C/C++: explicit versionif (n > 100) {!$omp parallel for

do i = 0, ncomputations()

end do!$omp end parallel} else {

do i = 0, ncomputations()

end do}

Fortran: explicit version

Conditional threading

#pragma omp parallel for if (n > 100)for (i = 0 ; i < n ; ++i) {

computations(); }

C/C++: implicit version!$omp parallel for if (n > 100)

do I = 0, ncomputations()

end do!$omp end parallel

Fortran: implicit version

#pragma omp parallel if (condition){

computations(); }

C/C++!$omp parallel if (condition)

computations() !$omp end parallel

Fortran

If the condition returns TRUE: this block is executed in parallel. if the condition returns FALSE: this block is executed in serial.

Load balancing So far iterations of a loop had the same amount of work. But this is not always the case. Working on a mesh with differentGranularities. A finer mesh requires more computations (more points). The default work sharing divides N iterations equally among nthreads. Some threads will finish their job sooner and sit idle reducing the performance.

How to increase the performance in this case?

use schedule construct

Scheduling clause in OpenMP Default work sharing: assigns N/nthr iterations to each thread. N (maximum number of iterations, nthr (number of threads).

It is better to assign smaller number of iterations for each thread: schedule(type,chunk) clause can change the default behavior. chunk = size or number of iterations. C/C++: #pragma omp parallel for schedule(type,chunck) Fortran: !$omp parallel do schedule (type,chunck)

Scheduling type: schedule (static[,m]) schedule (dynamic[,m]) schedule (guided[,m]) schedule (runtime[,m])

Scheduling clause: static Static: The distribution is done at loop entry and based on number of threads and total number of iterations. Less flexible and almost no scheduling overhead.

th0 th1 th2 th3 th0 th1 th2 th3 th0 th1 th2 th3 ….

th0 th1 th2 th3

with chunk size: Chunks with specified size are assigned in round-robin fashion. schedule (static,2)

without chunk size: One chunk of iterations per thread, all chunks (nearly) equal schedule (static)

Iteration

Scheduling clause: dynamic dynamic: The distribution is done during eth execution of the loop. Each thread is assigned a subset of the iterations at loop entry After completion each thread asks for more iterations.More flexible: can easily adjust to load imbalances.More scheduling overhead (synchronization). Threads request new chunks dynamically during runtime. The default chunk size is equal to 1

Example: schedule(dynamic,2) Iteration

th0 th1 th2 th3

Or

Or

Or. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Scheduling clause: guided guided: First chunk has implementation-dependent size. Size of each successive chunk decreases exponentially. Chunks are assigned dynamically. Chunks size specifies minimum size, default is 1

Example: schedule(guided,2) Iteration

th0 th1 th2 th3

Or

Or

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Scheduling clause: runtime

Schedule on demand:

Scheduling strategy can be chosen by environment variable If variable is not set: scheduling implementation dependent

export OMP_SCHEDULE=type [, chunk]

Example: schedule(runtime)

If no schedule parameter is given, then scheduling is implementation dependent. The correctness of program must not depend on scheduling used. omp_set_schedule() / omp_get_schedule() allow to set scheduling via runtime environment.

Tasks in OpenMP Non loop parallelism: sections/section Tasks: introduced in OpenMP 3.0 to enable non-loop or unbounded loop parallelism (while loop). Syntax: C/C++: #pragma omp task Fortran: !$omp task

A flexible alternative to the sectionsdirective. Created dynamically while the number of blocks in sections is static Tasks must be created within the parallel region. Tasks should usually be enclosed by single construct.

void processList(List* list) {while (list != NULL) {

process(list);list = list -> next;}

}

While loop

int factorial(int n) {If (n == 0)

return(1);return(n*factorial(n-1));

}

Recursive function

Tasks in OpenMP

void processList(List* list) {#pragma omp task

{while (list != NULL) { process(list); list = list -> next; }

}}void main() {#pragma omp parallel

{#pragma omp single

{processList(list);

}}

}

C/C++: while loop

Task parallelism in OpenMP

Definition of task:

a unit of independent work (block of code, like sections).

a direct or deferred execution of work performed by one thread of the team.

composed of code to be executed, and the data environment (constructed at creation time).

tied to a thread: only this thread can execute the task.

useful for unbounded loops, recursive algorithms, work on linked lists (pointer) …

Locks in OpenMP: C/C++More flexible way of implementing “critical regions” Lock variable type: omp_lock_t, passed by address

omp_init_lock(&lockvar);

Initialize look

omp_destroy_lock(&lockvar);

Deallocate look

omp_set_lock(&lockvar);

Blocks calling thread until lock is available

omp_unset_lock(&lockvar);

Release look

intv = omp_test_lock(&lockvar);

Test and try to set lock (returns 1 if success else 0)

Locks in OpenMP: FortranMore flexible way of implementing “critical regions” Lock variable has to be an integer.

call omp_init_lock(lockvar)

Initialize look

call omp_destroy_lock(lockvar)

Deallocate look

call omp_set_lock(lockvar)

Blocks calling thread until lock is available

call omp_unset_lock(lockvar)

Release look

intv = omp_test_lock(lockvar)

Test and try to set lock (returns 1 if success else 0)

Mechanism of locks in OpenMP

int main() {omp_lock_t lockvar; int id;omp_init_lock(&lockvar);

#pragma omp parallel shared(lockvar) private(id){id = omp_get_thread_num();while (!omp_test_lock(&lockvar)) { skip(id); }

/* we do not yet have the lock, so we must do something else */work(id); /* we now have the lock and can do the work */printf("Key given back by %d\n",id);omp_unset_lock(&lockvar);}

omp_destroy_lock(&lockvar);return 0;}

Mechanism of locks: example in C/C++

Some thoughts on parallelizing programs Is the serial version of the code well optimized?

Which compiler settings might increase the performance of

the code?

Estimate scalability of your code.

Which parts of the code are time consuming?

Is the amount of parallel regions as small as possible?

Was the most outer part of nested loops parallelized?

Use the “nowait“ clause whenever it is possible.

Is the workload well balanced over all threads?

Avoid false sharing effects and race condition.

Data dependencies and race conditions Most important rule:

Parallelization of code must not affect the correctness of a program! In loops:

The results of each single iteration must not depend on each other. Race conditions must be avoided. The results must not be affected by the order of threads. Correctness of the program must not depend on number of threads. Correctness must not depend on the work scheduling. Race condition:

Threads read and write to the same object at the same time. Unpredictable results (sometimes it works, sometimes not). Wrong answers without a warning signal! Correctness depends on order of read/write accesses: to insure that readers do not get ahead of writers, thread synchronization is needed.

Synchronization: barriers, critical regions, …Note: be careful with synchronization (it reduces the performance).

PBS Script for OpenMP programs#! /bin/bash#PBS -S /bin/bash#PBS –l nodes=1:ppn=4#PBS –l mem=2000mb#PBS –l walltime=24:00:00#PBS –M <your-valid-email>#PBS –m abe

# Load compiler module # and/or your application# module.

cd $PBS_O_WORKDIRecho "Current working directory is `pwd`"export OMP_NUM_THREADS=$PBS_NUM_PPN./your_openmp_exec < input_file > output_fileecho "Program finished at: `date`"

Resources: nodes=1 ppn=1 to maximum of N CPU (hardware) nodes=1:ppn=4 (for example).

# On systems where $PBS_NUM_PPN is not available, one could use:CORES=`/bin/awk 'END {print NR}' $PBS_NODEFILE`export OMP_NUM_THREADS=$CORES

Useful links and more readingsOpenMP: http://www.openmp.org/

Compute Canada Wiki: https://docs.computecanada.ca/wiki/OpenMP

WestGrid: https://www.westgrid.ca/support/programming

Reference cards: http://www.openmp.org/specifications/

OpenMP Wiki: https://en.wikipedia.org/wiki/OpenMP

Examples: http://www.openmp.org/updates/openmp-examples-4-5-published/

Contact: [email protected]

WestGrid events: https://www.westgrid.ca/events

introduction to parallel programming using openmp shared ... · compile & run an openmp program...

Documents