introduction to parallel programming using openmp shared ... · compile & run an openmp program...
TRANSCRIPT
WestGrid – Compute Canada - Online Workshop 2017
Introduction to Parallel Programming using OpenMP
Shared Memory Parallel Programming
Part – II
Dr. Ali Kerrache
WestGrid, Univ. of Manitoba, WinnipegE-mail: [email protected]
What do you need?
Basic Knowledge of:
C / C++ and/or Fortran Compilers: GNU, Intel, … Compile, Debug & Run a program.
Utilities:
Text editor: vim, nano, … ssh client: PuTTy, MobaXterm …http://mobaxterm.mobatek.net/download.html
Access to Grex:
Compute Canada account.WestGrid account.
GrexSlides & Examples (available): https://www.westgrid.ca/events/intro_openmp_part_2
How to participate in this workshop?
Copy the examples to your current working directory:
$ cp –r /global/scratch/workshop/openmp-wg-pII-2017 .$ cd openmp-wg-pII-2017 && ls
Reserve a compute node; then export the number of threads:$ sh reserve_omp_node.sh [ user-name@nijk ~] $ (ijk: 001-316)qsub: waiting for job 10535369.yak.local to start
$ export OMP_NUM_THREADS=4 [bash shell]$ setenv OMP_NUM_THREADS 4 [tcsh shell]
current directory
Login to Grex:$ ssh [email protected]
[ user-name@tatanka ~] $[ user-name@bison ~] $
Parallel Computing Using OpenMPOutline:
Introduction
Review of the first part and some examples
Intermediate and some advanced OpenMP directives
Example of PBS script for OpenMP programs
Conclusions
Parallel ProgrammingConcurrency Parallelism
CPU 3
MEM 3
CPU 2
MEM 2
CPU 1
MEM 1
CPU 0
MEM 0
CPU 3CPU 2CPU 1CPU 0
SHARED MEMORY
Distributed Memory Machines Shared Memory Machines
MPI OpenMP
Definition of OpenMP: API
Compiler Directives Runtime Library Environment Variables
Directives to add to a serial program and interpreted at compile time.
Directivesexecuted at run time.
Directives introduced after compile time to control OpenMP program.
FORK
JOINSerial
RegionSerial
Region
Parallel Region
OpenMP: Fork – Join modelMaster thread spawns a team of
threads as needed.
Parallelism is added incrementally: the sequential program evolves into a parallel program.
Serial region: master threadParallel region: all threads (team of threads)
OpenMP has simple syntaxMost of the constructs in OpenMP are compiler directives or pragma:
C/C++: the pragma take the form:#pragma omp construct [clause [clause]…]
Fortran: the directives take one of the forms:
!$OMP construct [clause [clause]…]C$OMP construct [clause [clause]…]*$OMP construct [clause [clause]…]
For C/C++: #include <omp.h> For F90: use omp_lib For F77: include ‘omp_lib.h’
#include <omp.h>
block of C/C++ code();#pragma omp parallel{structured block of a C/C++ code;}another block of C/C++ code();
C/C++
use omp_lib ! include ‘omp_lib.h’
block of F90/F77 code!$omp parallelstructured block of Fortran code!$omp end parallelanother block of F90/F77 code
Fortran
Directives on multiple lines
#pragma omp parallel list-of-some-directives \list-of-other-directives \list-of some-other-directives
{structured block of C/C++ code;}
C/C++
!$omp parallel list-of-some-directives &!$omp list-of-other-directives &!$omp list-of some-other-directivesstructured block of Fortran code!$omp end parallel
Fortran
The list of directives continues on the next lines
The list of directives continues on the next lines
Compile & Run an OpenMP ProgramTo compile and enable OpenMP:GNU: add –fopenmp as option to compile programs. Intel compilers: add –openmp as option to compile programs.
Environment variables: OMP_NUM_THREADS If not specified: OpenMP will spawns one thread per hardware thread.
$ export OMP_NUM_THREADS=value [ bash shell ] $ setenv OMP_NUM_THREADS value [ tcsh shell ]
value: number of threads [ For example 4 ]
To run: $ ./exec_program or ./a.out
Conditional compilation
C/C++ and Fortran (last versions of OpenMP: 4.0)
Preprocessor macro _OPENMP for C/C++ and Fortran
#ifdef _OPENMPMyID = omp_get_thread_num();#endif
Special comment for Fortran preprocessor
!$ MyID = OMP_GET_THREAD_NUM()
Helpful check of serial and parallel version of the code
Taken into account when compiled with OpenMP. Ignored if compiled in serial mode.
Runtime Library
omp_set_num_threads(NTHREADS); Set number of threads
Get number of threads
Get thread rank
Get time
maxthreads = omp_get_max_threads();
ID = omp_get_thread_num();
time = omp_get_wtime();
nthreads = omp_get_num_threads();
Get max of threads
To learn more about runtime library in C/C++ or Fortran:
http://www.openmp.org/specifications/
Data Environment
C/C++: default ( shared | none )Fortran: default ( private | firstprivate | shared | none )
only a single instance of variables in shared memory. all threads have read and write access to these variables.
shared
Each thread allocates its own private copy of the data. These local copies only exist in parallel region. Undefined when entering or exiting the parallel region.
private
variables are also declared to be private. additionally, get initialized with value of original variable.firstprivate
declares variables as private. variables get value from the last iteration of the loop.lastprivate
It is highly recommended to use: default ( none )
Work sharing: loops and sections [section]
#pragma omp parallel{
#pragma omp for {calc();}
}#pragma omp parallel for { calc(); }
C/C++: Loops
!$omp parallel!$omp do!$omp end do!$omp end parallel
!$omp parallel do!$omp end parallel do
Fortran: Loops
#pragma omp parallel#pragma omp sections {
#pragma omp section{ some computation(); }#pragma omp section{ some computation(); }
}
C/C++: Sections / section
!$omp sections!$omp section
some computation!$omp end section!$omp section!$omp end section!$omp end sections
Fortran: Sections / section
Reduction construct in OpenMP Aggregating values from different threads is a common operationthat OpenMP has a special reduction variable: Similar to private and shared Reduction variables can support several types of operations: + - *
Syntax of the reduction clause: reduction (op : list)
Inside a parallel or a work-sharing construct: A local copy of each list of variables is made and initialized depending on the “op” : 0 for “+” or “-” Updates occur on the local copy. Local copies are reduced into a single value and combined with the original global value. The variables in “list” must be shared in the enclosing parallel region.
SynchronizationSynchronization: Bring threads to a well defined point in their execution. Barrier: each thread wait at the barrier until all threads arrive.Mutual exclusion: only one thread at a time can execute.
High level constructs: single: only one thread will execute the following block.master: only the master thread will execute the following block. critical: only one thread at a time will execute the following block. atomic: used for local updates barrier: all threads must arrive here before going further.
Barrier Mutual exclusion
Synchronization: can reduce the performance. cause overhead and cost a lot.more barriers will serialize the program. Use appropriate synchronization construct.
Barrier in OpenMP
The barrier directive explicitly synchronizes all the threads in
a team.
When encountered, each thread in the team waits until all the
others have reached this point.
There are also implicit barriers at the end of:
parallel region:
They can cannot be changed.
work share constructs: do/for, sections, single
They can be disabled by specifying nowait
C/C++:#pragma omp barrier
Fortran:!$ omp barrier
Review of some constructs and clauses
Work sharing: #pragma omp for #pragma omp sections
Runtime library: omp_set_num_threads(); omp_get_num_threads(); omp_get_thread_num(); omp_get_max_threads(); omp_get_wtime();
Variables: shared(list) private(list) firstprivate(list) lastprivate(list) reduction(op:list)
Synchronization: #pragma omp master #pragma omp single #pragma omp critical #pragma omp barrier #pragma omp atomic
Create threads in C/C++:#pragma omp parallel { structured block();}
Create threads in Fortran:!$ omp parallel structured block()!$omp end parallel
Nested Loops: collapse construct
#pragma omp parallel for collapse (3)for ( i = 0; I < N; i++) {
for ( j = 0; j < M; j++) {for ( k = 0; k < L; k++) {
block of C/C++ code; }
}}
C/C++!$omp parallel for collapse (3)do i = 1, N
do j = 0, Mdo k = 0, L
block of Fortran code end do
end doend do!$omp end parallel for
Fortran
More then one loop: use collapse(n) construct
Argument must be the number of the loops to collapse. It will form a single loop of N*M*L iterations.
Orphaned work sharing construct
void do_some_computation(int v[], int n) {int i;#pragma omp forfor (i=0;i<n;++i) { v[i] = 0; }
}
int main() {int size = 200; int v[size];
#pragma omp parallel{
do_some_computation(v,size); /* Case 1 */}do_some_computation(v,size); /* Case 2 */return 0;}
Example of orphaned construct in C/C++
Work-sharing construct can occur anywhere outside the lexical extent of a parallel region
orphaned construct: Case 1: called in a parallel context, it works as expected. Case 2: called in a sequential context , it “ignores” directive.
Switch off synchronization: nowait clause To ensure the correctness of computation, threads sometimes need to be synchronized. Critical region and atomic updates. At the end of parallel region and the loop construct.
To override the default behavior, ”nowait” clause can be used: Use "nowait" parameter for parallel loops which do not need to be synchronized upon exit of the loop. Keeps synchronization overhead low. Hint: barrier at the end of parallel region cannot be suppressed. In Fortran, "nowait" needs to be given in the end part. Check for data dependencies before "nowait" is being used. If there are multiple independent loops in a parallel region, having ”nowait” could improve the performance.
Example: nowait clause In the following example, the loops are independent. Using “nowait” will increase the performance.
#pragma omp parallel{#pragma omp for nowaitfor (i=0 ; i< Nsteps ; ++i) {
a[i] = b[i] + c[i] + d[i]; }
#pragma omp for nowaitfor (i=0 ; I < Nsteps ; ++i) {
z[i] = Function(b[i]); }
}
C/C++!$omp parallel!$omp dodo i = 1, Nsteps
a(i) = b(i) + c(i) + d (i)end do!$omp end do nowait!$omp dodo i = 1, Nsteps
z(i) = Function((b(i))en do!$omp end do nowait!$omp end parallel
Fortran
Conditional Threading Creating threads takes longer than doing the calculation by one thread. If (condition) clause is used for conditional threading to avoid extra overhead for small size problems ( add a condition to omp parallel).
if (n > 100) {#pragma omp parallel forfor (i = 0 ; i < n ; ++i) {
computations(); }
} else {for (i = 0 ; i < n ; ++i) {
Computations();}
}
C/C++: explicit versionif (n > 100) {!$omp parallel for
do i = 0, ncomputations()
end do!$omp end parallel} else {
do i = 0, ncomputations()
end do}
Fortran: explicit version
Conditional threading
#pragma omp parallel for if (n > 100)for (i = 0 ; i < n ; ++i) {
computations(); }
C/C++: implicit version!$omp parallel for if (n > 100)
do I = 0, ncomputations()
end do!$omp end parallel
Fortran: implicit version
#pragma omp parallel if (condition){
computations(); }
C/C++!$omp parallel if (condition)
computations() !$omp end parallel
Fortran
If the condition returns TRUE: this block is executed in parallel. if the condition returns FALSE: this block is executed in serial.
Load balancing So far iterations of a loop had the same amount of work. But this is not always the case. Working on a mesh with differentGranularities. A finer mesh requires more computations (more points). The default work sharing divides N iterations equally among nthreads. Some threads will finish their job sooner and sit idle reducing the performance.
How to increase the performance in this case?
use schedule construct
Scheduling clause in OpenMP Default work sharing: assigns N/nthr iterations to each thread. N (maximum number of iterations, nthr (number of threads).
It is better to assign smaller number of iterations for each thread: schedule(type,chunk) clause can change the default behavior. chunk = size or number of iterations. C/C++: #pragma omp parallel for schedule(type,chunck) Fortran: !$omp parallel do schedule (type,chunck)
Scheduling type: schedule (static[,m]) schedule (dynamic[,m]) schedule (guided[,m]) schedule (runtime[,m])
Scheduling clause: static Static: The distribution is done at loop entry and based on number of threads and total number of iterations. Less flexible and almost no scheduling overhead.
th0 th1 th2 th3 th0 th1 th2 th3 th0 th1 th2 th3 ….
th0 th1 th2 th3
with chunk size: Chunks with specified size are assigned in round-robin fashion. schedule (static,2)
without chunk size: One chunk of iterations per thread, all chunks (nearly) equal schedule (static)
Iteration
Scheduling clause: dynamic dynamic: The distribution is done during eth execution of the loop. Each thread is assigned a subset of the iterations at loop entry After completion each thread asks for more iterations.More flexible: can easily adjust to load imbalances.More scheduling overhead (synchronization). Threads request new chunks dynamically during runtime. The default chunk size is equal to 1
Example: schedule(dynamic,2) Iteration
th0 th1 th2 th3
Or
Or
Or. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Scheduling clause: guided guided: First chunk has implementation-dependent size. Size of each successive chunk decreases exponentially. Chunks are assigned dynamically. Chunks size specifies minimum size, default is 1
Example: schedule(guided,2) Iteration
th0 th1 th2 th3
Or
Or
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Scheduling clause: runtime
Schedule on demand:
Scheduling strategy can be chosen by environment variable If variable is not set: scheduling implementation dependent
export OMP_SCHEDULE=type [, chunk]
Example: schedule(runtime)
If no schedule parameter is given, then scheduling is implementation dependent. The correctness of program must not depend on scheduling used. omp_set_schedule() / omp_get_schedule() allow to set scheduling via runtime environment.
Tasks in OpenMP Non loop parallelism: sections/section Tasks: introduced in OpenMP 3.0 to enable non-loop or unbounded loop parallelism (while loop). Syntax: C/C++: #pragma omp task Fortran: !$omp task
A flexible alternative to the sectionsdirective. Created dynamically while the number of blocks in sections is static Tasks must be created within the parallel region. Tasks should usually be enclosed by single construct.
void processList(List* list) {while (list != NULL) {
process(list);list = list -> next;}
}
While loop
int factorial(int n) {If (n == 0)
return(1);return(n*factorial(n-1));
}
Recursive function
Tasks in OpenMP
void processList(List* list) {#pragma omp task
{while (list != NULL) { process(list); list = list -> next; }
}}void main() {#pragma omp parallel
{#pragma omp single
{processList(list);
}}
}
C/C++: while loop
Task parallelism in OpenMP
Definition of task:
a unit of independent work (block of code, like sections).
a direct or deferred execution of work performed by one thread of the team.
composed of code to be executed, and the data environment (constructed at creation time).
tied to a thread: only this thread can execute the task.
useful for unbounded loops, recursive algorithms, work on linked lists (pointer) …
Locks in OpenMP: C/C++More flexible way of implementing “critical regions” Lock variable type: omp_lock_t, passed by address
omp_init_lock(&lockvar);
Initialize look
omp_destroy_lock(&lockvar);
Deallocate look
omp_set_lock(&lockvar);
Blocks calling thread until lock is available
omp_unset_lock(&lockvar);
Release look
intv = omp_test_lock(&lockvar);
Test and try to set lock (returns 1 if success else 0)
Locks in OpenMP: FortranMore flexible way of implementing “critical regions” Lock variable has to be an integer.
call omp_init_lock(lockvar)
Initialize look
call omp_destroy_lock(lockvar)
Deallocate look
call omp_set_lock(lockvar)
Blocks calling thread until lock is available
call omp_unset_lock(lockvar)
Release look
intv = omp_test_lock(lockvar)
Test and try to set lock (returns 1 if success else 0)
Mechanism of locks in OpenMP
int main() {omp_lock_t lockvar; int id;omp_init_lock(&lockvar);
#pragma omp parallel shared(lockvar) private(id){id = omp_get_thread_num();while (!omp_test_lock(&lockvar)) { skip(id); }
/* we do not yet have the lock, so we must do something else */work(id); /* we now have the lock and can do the work */printf("Key given back by %d\n",id);omp_unset_lock(&lockvar);}
omp_destroy_lock(&lockvar);return 0;}
Mechanism of locks: example in C/C++
Some thoughts on parallelizing programs Is the serial version of the code well optimized?
Which compiler settings might increase the performance of
the code?
Estimate scalability of your code.
Which parts of the code are time consuming?
Is the amount of parallel regions as small as possible?
Was the most outer part of nested loops parallelized?
Use the “nowait“ clause whenever it is possible.
Is the workload well balanced over all threads?
Avoid false sharing effects and race condition.
Data dependencies and race conditions Most important rule:
Parallelization of code must not affect the correctness of a program! In loops:
The results of each single iteration must not depend on each other. Race conditions must be avoided. The results must not be affected by the order of threads. Correctness of the program must not depend on number of threads. Correctness must not depend on the work scheduling. Race condition:
Threads read and write to the same object at the same time. Unpredictable results (sometimes it works, sometimes not). Wrong answers without a warning signal! Correctness depends on order of read/write accesses: to insure that readers do not get ahead of writers, thread synchronization is needed.
Synchronization: barriers, critical regions, …Note: be careful with synchronization (it reduces the performance).
PBS Script for OpenMP programs#! /bin/bash#PBS -S /bin/bash#PBS –l nodes=1:ppn=4#PBS –l mem=2000mb#PBS –l walltime=24:00:00#PBS –M <your-valid-email>#PBS –m abe
# Load compiler module # and/or your application# module.
cd $PBS_O_WORKDIRecho "Current working directory is `pwd`"export OMP_NUM_THREADS=$PBS_NUM_PPN./your_openmp_exec < input_file > output_fileecho "Program finished at: `date`"
Resources: nodes=1 ppn=1 to maximum of N CPU (hardware) nodes=1:ppn=4 (for example).
# On systems where $PBS_NUM_PPN is not available, one could use:CORES=`/bin/awk 'END {print NR}' $PBS_NODEFILE`export OMP_NUM_THREADS=$CORES
Useful links and more readingsOpenMP: http://www.openmp.org/
Compute Canada Wiki: https://docs.computecanada.ca/wiki/OpenMP
WestGrid: https://www.westgrid.ca/support/programming
Reference cards: http://www.openmp.org/specifications/
OpenMP Wiki: https://en.wikipedia.org/wiki/OpenMP
Examples: http://www.openmp.org/updates/openmp-examples-4-5-published/
Contact: [email protected]
WestGrid events: https://www.westgrid.ca/events