class09_openmp, ii.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 1/58

Practical Issues in OpenMP

M. D. Jones, Ph.D.

Center for Computational ResearchUniversity at Buffalo

State University of New York

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 1 / 61



Scheduling Load Balance

Loop Scheduling

The way in which iterations of a parallel loop get assigned to

threads is determined by the loop’s schedule

Default scheduling typically assumes an equal load balance,frequently the case that different iterations can have entirelydifferent computational loads

Load imbalance can cause significant synchronization delays





Static vs. Dynamic Scheduling

Basic distinction of loop scheduling:

Static: iteration assignment to threads determined as function of

iteration/thread number

Dynamic: assignment can vary at run-time, and iterations arehanded out to threads as they complete previouslyassigned iterations

Iterations in both schemes can be assigned in chunks





SCHEDULE Clause

The general form of the SCHEDULE clause:

SCHEDULE clause

schedule(type [,chunk])

where type can be one of:static without chunk, threads given equally sized subdivision of

iterations (exact placement implementation-dependent).With chunk, iterations divided into chunk-sized pieces,remainder allocation is implementation dependent

dynamic iterations divided into chunks (default is one if chunk notpresent), assigned dynamically at run-time





guided first chunk size determined by implementation, then

subsequently decreased exponentially (value isimplementation-dependent) to minimum size specified bychunk (default 1)

runtime chunk must not appear, schedule determined by value ofenvironmental variable OMP_SCHEDULE

auto (OpenMP 3.0) gives implementation freedom to choosebest mapping of iterations to threads


S h d li L d B l




Scheduling Considerations

Things to consider when choosing between scheduling options

Dynamic schedules can better balance the load between threads,

but typically have higher overhead costs (synchronization costsper chunk)

Guided schedules have the advantage of typically requiring fewerchunks (translates to fewer synchronizations) - typically the initial

chunk size is roughly the number of iterations divided by thenumber of threads

Simple static has the lowest overhead, but is most susceptible toload imbalances


O MP I & G t h N ti



OpenMP Issues & Gotchas Nesting

Easy to Use?

OpenMP does not force the programmer to explicitly manage

communication or how the program data is mapped onto

individual processors - sounds great ...

OpenMP program can easily run into common SMP programmingerrors, usually from resource contention issues.






Directive Nesting

DO/for, SECTIONS, SINGLE, and WORKSHARE directives that bind tothe same parallel region are not allowed to be nested.

DO/for, SECTIONS, SINGLE, and WORKSHARE directives are notallowed in the dynamical extent of CRITICAL, ORDERED, and MASTERdirectives.

BARRIER and MASTER are not permitted in the dynamic extent ofDO/for, SECTIONS, SINGLE, WORKSHARE, MASTER, CRITICAL, andORDERED directives.

ORDERED must appear in the dynamical extent of a DO or PARALLEL

DO with an ORDERED clause. ORDERED is not allowed in thedynamical extent of SECTIONS, SINGLE, WORKSHARE, CRITICAL,and MASTER.


OpenMP Issues & Gotchas Shared vs Private



OpenMP Issues & Gotchas Shared vs. Private

Data Storage Defaults

Most variables are SHARED by default

Fortran: common blocks, save variables , MODULE variables.

C: file scope variables, static variables.with some exceptions ...

stack variables in sub-programs called from a PARALLEL region.automatic variables within a statement blockloop indices (in C just on “work-shared” loops)


OpenMP Issues & Gotchas Shared vs Private



OpenMP Issues & Gotchas Shared vs. Private

Data Storage Gotchas

Assumed size and assumed shape arrays can not be privatized.

Fortran allocatable arrays (and pointers) can be PRIVATE or

SHARED, but not FIRSTPRIVATE or LASTPRIVATE.

Constituent elements of a PRIVATE(FIRSTPRIVATE/LASTPRIVATE) name common block can not bedeclared in another data scope clause.

Privatized elements of shared common blocks are no longer

storage equivalent with the common block.


OpenMP Issues & Gotchas Synchronization & Barriers




Synchronization Awareness

Implied Barriers :

1 END PARALLEL

2 END DO (unless NOWAIT)3 END SECTIONS (unless NOWAIT)4 END CRITICAL5 END SINGLE (unless NOWAIT)






Implied Flushes :

1 BARRIER2 CRITICAL/END CRITICAL3 END DO4 END PARALLEL5 END SECTIONS6 END SINGLE7 ORDERED/END ORDERED





p y

Synchronization Costs

Overhead for synchronization on an SGI Origin 2000 (MIPS250MHz R10000 processors)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]1 2.0 2.3 0.1 2.1

2 8.4 7.8 0.4 11.04 11.6 6.8 1.5 20.78 28.0 14.1 3.1 31.0

10µs? Isn’t that pretty small?

10µs×250MHz =2500 clock cycles - lost computation.





p y

Synchronization Costs (cont’d)

Overhead for synchronization on an SGI Altix 3700 (Intel1300MHz Itanium2 processors)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]1 0.3 0.3 0.1 0.52 2.3 2.1 0.4 2.6

4 5.9 4.7 0.4 9.68 6.6 6.8 0.5 24.1

16 10.3 10.7 0.6 60.732 19.2 19.3 0.7 13264 41.8 40.9 0.7 316

10µs? Isn’t that pretty small?







Overhead for synchronization on an Intel “Clovertown” (dual

quad-core 1.866GHz Xeon processors)Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]

1 0.2 0.2 0.02 0.22 1.6 1.7 0.08 2.04 2.3 2.4 0.14 3.18 3.8 3.9 0.52 5.8

5.8µs×1866MHz =10823 clock cycles - lost computation.

Overhead for synchronization on an Intel “Nehalem” (dualquad-core 2.8GHz Xeon processors)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]1 0.1 0.1 0.01 0.1

2 1.1 1.1 0.04 1.24 1.2 1.2 0.05 1.58 1.7 1.8 0.05 2.5

2.5µs×2800MHz =7000 clock cycles - lost computation.






Overhead for synchronization on a 32-core Intel "Westmere"2130MHz system (4 sockets, 8 cores/socket, Xeon E7-4530)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]

1 0.1 0.2 0.02 0.22 2.2 2.3 0.04 2.74 2.9 3.1 0.07 4.28 3.9 3.9 0.07 6.8

16 4.8 5.2 0.07 12.232 15.4 6.9 0.07 24.9


Not exactly great progress ...





Common Errors

Race conditions : outcome of the program depends on detailed

scheduling of thread team (the answer is different every

time I run the code!).

Deadlock : threads wait forever for a locked resource to become

free.





Race Conditions

What is wrong with this code fragment?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x )3 do i =1,10000

4 tmp=dosomework ( i )5 x=x+tmp6 end do7 !$OMP END DO8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL


OpenMP Issues & Gotchas Race Conditions & Deadlock



Race Conditions

What is wrong with this code fragment?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x )3 do i =1,100004 tmp=dosomework ( i )

5 x=x+tmp6 end do7 !$OMP END DO8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL

The programmer did not make tmp PRIVATE, hence the results

are unpredictable.



R C di i



Race Conditions

What about now?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x ) ,PRIVATE( tmp )3 do i =1,10000

4 tmp=dosomework ( i )5 x=x+tmp6 end do7 !$OMP END DO NOWAIT8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL



R C diti



Race Conditions

What about now?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x ) ,PRIVATE( tmp )3 do i =1,100004 tmp=dosomework ( i )

5 x=x+tmp6 end do7 !$OMP END DO NOWAIT8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL

The value of x is not dependable without the barrier at the end of

the DO construct - be careful with NOWAIT!



D dl k



Deadlock

A somewhat artificial example of deadlock - watch that resourcesare freed if you are using locks!

1 c a l l OMP_INIT_LOCK( lock0 )2 !$OMP PARALLEL SECTIONS3 !$OMP SECTION

4 c a l l OMP_SET_LOCK( l o c k 0 )5 i r e t = d o l o t s o f w o r k ( )6 i f ( i r e t . l e . t o l ) then7 c a l l OMP_UNSET_LOCK( l o c k 0 )8 else9 c a l l e r r o r ( i r e t )

10 e n d i f11 !$OMP SECTION12 c a l l OMP_SET_LOCK( l o c k 0 )

13 c a l l co mp u te (A,B, i re t )14 c a l l OMP_UNSET_LOCK( l o c k 0 )15 $ !OMP END SECTIONS


OpenMP Issues & Gotchas Load Balancing

L d B l i



Load Balancing

Consider the following code fragment - can you see why it not

efficient to parallelize on the outer loop?

1 do i =1,N2 do j =1, i3 a ( j , i )= a ( j , i )+ b ( j )∗c ( i )4 end do5 end do



Load Balancing



Load Balancing

One strategy - break up the loop into interleaved chunks,

1 !$OMP PARALLEL SHARED ( nu m_ th re ad s )2 !$OMP SINGLE3 num_threads = OMP_GET_NUM_THREADS( )4 !$OMP END SINGLE NOWAIT5 !$OMP END PARALLEL6 !$OMP PARALLEL DO PRIVATE ( i , j , k )7 do k = 1 , n um _t hr ea ds8 do i = k , n , n um _t hr ea ds9 do j = 1 , i

10 a ( j , i ) = a ( j , i ) + b ( j )∗c ( j )11 end do12 end do13 end do



Load Balancing



Load Balancing

Another equivalent (and somewhat cleaner!) way,

!$OMP PARALLEL DO PRIVATE ( i , j ) SCHEDULE( s t a t i c , 4 )

do i =1,ndo j =1, i

a( j , i )=a ( j , i )+b( j )∗c ( j )end do

end do



Toward Coarser Grains



Toward Coarser Grains

What is wrong with fine grain (loop) parallelism?

Overhead kills performance

Not scalable to large number of threads

S (N p ) =τ s + τ p

τ s + τ p /P =

1

S + (1− S )/P

Remember Amdahl’s law!


OpenMP Issues & Gotchas Coarsening

Coarsening



Coarsening

Strategies for increasing OpenMP performance,

do more work per parallel region, and decrease fraction of time

spent in sequential code.

reduce synchronization across threads

combine multiple parallel do directives into larger parallel region(with work-sharing constructs therein)



Coarsening (cont’d)



Coarsening (cont d)

Domain Decomposition

Break Data domain into sub-domains,

Compute loop bounds once depending on number of threads (a

priori loop decomposition),

Reduces loop overhead, but shifts burden from compiler back tothe programmer,

Implements the Single Program Multiple Data model (SPMD).



Coarse Grain SPMD Example




1 program spmd2 $ !OMP PARALLEL DEFAULT( PRIVATE ) SHARED( N, g l o b a l )3 num_threads = OMP_GET_NUM_THREADS( )4 iam = OMP_GET_THREAD_NUM( )5 i c h u n k = N / num_threads6 i b e g i n = iam∗ i ch u n k

7 i e n d = i b e g i n + i ch u n k − 18 c a l l l o t s o f w o r k ( i b e g i n , ie n d , l o c a l )9 $ !OMP ATOMIC

10 g l o b a l = g l o b a l + l o c a l11 $ !OMP END PARALLEL12 p r i n t ∗ , g l ob a l13 end program spmd







!$& DEFAULT(PRIVATE)

!$& SHARED(M,global)

!$OMP PARALLEL ...

program spmd

.

.

.

ibegin

iend

local

ibegin

iend

local

ibegin

iend

local

ibegin

iend

local



SPMD Implementation



SPMD Implementation

Manual decomposition - valid for any number of threads (makesure that cost/benefit ratio is high enough!)

Same program on each thread, but a different (PRIVATE)sub-domain of the program data.

Synchronization necessary to handle global variable updates(ATOMIC usually more efficient than CRITICAL).


OpenMP Issues & Gotchas Thread-safety

Thread Safety Issues



Thread Safety Issues

Certainly one must be careful about hidden state issues when callingfunctions/routines from multiple threads:

MPI - check your level of thread safety with MPI_Init_thread

and program accordingly

Other functions - Up to you to check and ensure thread-safefunctions (danger in treating any function as a black box)


OpenMP Issues & Gotchas Thread-safety

Thread-safe Example



Thread safe Example

From the rand man page (section 3, RHEL 5):

# i n c lu d e < s t d l i b . h>

i n t r an d ( v o i d ) ;

i n t r an d_ r ( u n si gn ed i n t ∗seedp ) ;

v o i d s ra nd ( u ns ig ne d i n t s eed ) ;

DESCRIPTIONThe r an d ( ) f u n c t i o n r e t u r n s a p seudo−random i n t e g e r between 0 and RAND_MAX.

The s ran d ( ) f u n c t i o n s et s i t s argument as t h e seed f o r a new sequence o fpseudo−random i n t e g e r s t o be r e t u r n e d by r an d ( ) . These s eq ue nc es a r e r e p e a ta b l eby c a l l i n g s ra nd ( ) w i t h t h e s ame s eed v a l ue .

I f no s eed v al ue i s p ro vi de d , t he r and ( ) f u n c t i o n i s a u t o ma t i c a ll y seeded w i th a v al ueo f 1 .

The f u n c t i o n r and ( ) i s n ot r e e n tr a n t o r t hr ea d−s af e , s i nc e i t uses hi dde n s t a t e t h a t i s

m od if ie d o n each c a l l . T hi s m ig ht j u s t be t he seed v al ue t o be u sed b y t he n ex t c a l l ,o r i t m ig ht be s om et hi ng more e l a bo r a t e . I n o r de r t o g et r e p r o d uc i b l e b eh av io ur i n at hr ea de d a p p l ic a t i o n , t h i s s t a t e must b e made e x p l i c i t . The f u n c t i on r an d_ r ( ) i ss u pp li e d w i th a p o i n t er t o a n un si gn ed i n t , t o b e used a s s t a t e . T hi s i s a v e ry s ma llamount o f s t a te , so t h i s f u n c t i o n w i l l be a w eak p seudo−rand om g e n e ra t o r . Tryd r an d 48 _ r ( 3 ) i n s t e a d .


OpenMP Issues & Gotchas C/C++ Max/Min

Lack of Max/Min in C/C++



Lack of Max/Min in C/C++

Due to a lack of an intrinsic max/min function in C/C++, we have no

built-in reduction operator in OpenMP, so one way to do so is to haveeach thread track its max/min value, and then update the globalmax/min accordingly with a protective directive:

1 #pragma omp p a r a l l e l p r i v a t e ( my_amax )2 {3 amax = 0 ;

4 my_amax = 0 ;5 / ∗ use p r i v a t e v a r i a b l e f o r max p er t hr ea d ∗ / 6 #pragma omp f o r7 f o r ( i =0; i <=N; i ++) {8 i f (a [ i ] > my_amax) {9 my_amax = a [ i ] ;

10 }11 } / ∗ g l o b a l up dat e , r e q u i r e s o n ly num_t hreads c r i t i c a l e v a l u at i o n s ∗ / 12 #pragma omp c r i t i c a l13 i f (my_amax > amax) {14 {15 amax = my_amax ;16 }17 }18 }



Max/Min with Locks



a / t oc s

Another way to do max/min, this time with OpenMP locks:

1 omp_lock_t MAXLOCK;23 o mp _ i n i t _ l o c k (&MAXLOCK) ;45 #pragma omp p a r a l l e l f o r6 f o r ( i = 0 ; i < Nu mb er of el em en ts ; i + +) {

7 i f ( a r r a y [ i ] > c ur _m ax ) {8 om p_s et _lo ck (&MAXLOCK ) ;9 i f ( a r r a y [ i ] > c ur _m ax ) {

10 cur_max = a r r a y [ i ] ;11 }12 o mp_ un se t_ loc k (&MAXLOCK ) ;13 }1415 / ∗ D e s t r o y i n g The L oc k ∗ /

16 omp_des troy_l ock (&MAXLOCK) ;



Example - Compare Max/Min with Critical vs. Lock



p p

Compare the two methods - find the max in a randomly seeded array

of varying size, serially and using the OpenMP critical and lock method

(note that the outcome should be pretty obvious based on the twocoding examples, but you can tinker with them to make the distinctionless clear).


OpenMP Issues & Gotchas Thread Affinity

Thread Affinity



y

There are many times in which you may wish or need to specify howyour compute threads get mapped to the physical (do not confuse

physical with logical here) CPU cores:

Contention for cache memory

Contention for network interfaces (especially when combined withmessage-passing)



GNU Options



p

The GNU compilers currently support an option for CPU affinity, aswell as an option for adjusting the available stack space per thread:

GOMP_CPU_AFFINITY : space-separated or comma-separated listof CPUs, either single CPU numbers in any order, a range

of CPUs (M-N) or a range with some stride (M-N:S). Notethat cores are counted starting from 0. Note that this view

of CPU core reflects that of the operating system, whichin many cases is not the full picture of the underlyinghardware topology.

GOMP_STACKSIZE : sets the default thread stack size in kilobytes.



Intel Options



p

The Intel compilers support a much richer set of utilities for controllingthe placement of threads:

KMP_AFFINITY is the environment variable used, although theIntel run-time will also respect the GOMP_CPU_AFFINITY variable

at a lower level of precedence.

Details can be found (and are frequently changed as the

architecture evolves) in the compiler documentation, but the latestas of this writing can be found at:

http://software.intel.com/sites/products/

documentation/hpc/compilerpro/en-us/fortran/win/

compiler_f/optaps/common/optaps_openmp_thread_

affinity.htmBest bet is to review the documentation for the compiler that youare trying to use.

Extremely helpful when simultaneous multi-threading (also knownas hyper-threading) is turned on.


OpenMP Issues & Gotchas OpenMP vs. MPI

Advantages over Message Passing

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/win/compiler_f/optaps/common/optaps_openmp_thread_affinity.htm










Domain decomposition methodology is the same, butimplementing it in OpenMP can be easier, as global data can be

read without any need for synchronization or message passing.Parallelize only parts of the code that require it (profiling is key!).Pre and Post Processing can be left sequential.


OpenMP Issues & Gotchas OpenMP vs. MPI

Best of Both Worlds?



How about combining OpenMP with Message Passing?

Message Passing between machines, OpenMP within.

Allow application dependent mixing within a shared memoryenvironment.

Coarse grain with Message Passing, fine grain with OpenMP.


Practical OpenMP Compiler Support

Platforms & Compilers



This table lists the various compiler suites available on the productioncomputing platforms along with their OpenMP compiliance:

Platform Compiler OMP Invocation

Linux IA64 Gnu (g77/gcc/g++) No – Intel (ifort/icc/icpc) 2.5 -openmp -openmp_re

Linux x86_64 Gnua (g77/gcc/g++) 2.5(>4.1),3.0(>4.4),3.1(>4.7) – PGI (pgf90/pgcc/pgCC) 2.5, 3.0(≥ 12.0) -mpIntel (ifort/icc/icpc) 2.5,3.0(≥ 11.0), 3.1(≥ 12.1) -openmp -openmp_re

a The Gnu compiler suite supports OpenMP for versions >4.2, although some Linux distributions

(e.g. RedHat) have backported support to 4.1


Practical OpenMP Example - Simple

Simple OpenMP example



program si mp l eUSE omp_lib ! comment o u t f o r p g f9 0 − i f n o t openmp 2 . 0 c o m p l i a n ti m p l i c i t none

i n t e g e r : : m yid , n t hr ea ds , n pr oc s! i n cl u de t h i s d e c l a r at i o n f o r pg f90! in t e g e r : : OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM,OMP_GET_NUM_PROCS

!$OMP PARALLEL de fa ul t ( none) pr iv at e ( myid ) &

!$OMP shared ( nthre ads , nprocs )!! D et er mi ne t he number o f t h re a ds and t h e i r i d!myid = OMP_GET_THREAD_NUM( )n th re ad s = OMP_GET_NUM_THREADS ( ) ;nprocs = OMP_GET_NUM_PROCS ( ) ;!$OMP BARRIERi f (myid==0) p r i n t ∗ , ’ Number o f a v a i l a b l e p r oc e s so r s : ’ , n p r oc sp r i n t ∗ , ’ m yid = ’ , myid , ’ n th re ad s ’ , n th re ad s!$OMP END PARALLELend program si mp l e



Altix - simple example



[ j on esm@le nn on ~/d_omp ] $ mod ul e l o a d i n t e l[ jonesm@lennon ~/d_omp] $ i f o r t −O3 −o s i m p l e _ i f o r t −openmp −openmp_report2simpl e . f90si mp l e . f9 0 (1 9 ) : ( co l . 6 ) re ma rk : OpenMP mu l t i th re a d e d cod e g e n e ra t i o n BARRIERwas s u c c e s s f u l .si mp le . f9 0 ( 9 ) : ( c o l . 6) rem ark : OpenMP DEFINED REGION WAS PARALLELIZED .

[ jonesm@lennon ~ /d_omp ] $ se te nv OMP_NUM_THREADS 4[ j on esm@le nn on ~/d_omp ] $ . / si mp l e _ i fo rtmyid = 1 nt hr ea d s 4myid = 3 nt hr ea d s 4myid = 2 nt hr ea d s 4Number o f a v a i l a b l e p ro ce s s or s : 4myid = 0 nt hr ea d s 4



U2 - simple example



[ jonesm@bono ~/d_omp] $ module load i n t e l[ jonesm@bono ~/d_omp] $ i f o r t −O3 −o s i m p l e _ i f o r t −openmp sim ple . f90[ jonesm@bono ~ /d_omp ] $ set en v OMP_NUM_THREADS 4[ jonesm@bono ~/d_omp ] $ . / si mp l e _ i fo rt

Number o f a v a i l a b l e p ro ce s s or s : 4myid = 1 nt hr ea d s 4myid = 0 nt hr ea d s 4myid = 2 nt hr ea d s 4myid = 3 nt hr ea d s 4

[ jonesm@bono ~/d_omp ] $ module loa d pg i[ jonesm@bono ~/ d_omp] $ pgf9 0 −O3 −mp−o s i m p l e _ p g i s i m p l e . f 9 0[ jonesm@bono ~/d_omp] $ . / sim ple_ pgi

Number o f a v a i l a b l e p ro ce s s or s : 4myid = 0 n th re a d s 4myid = 3 n th re a d s 4

myid = 1 n th re a d s 4myid = 2 n th re a d s 4



U2 - simple example



[ k07 n1 4 : ~/ d_omp ] $ g cc −fopenmp −o h e l lo 2 h e l lo 2 . c[ k07n14 : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=1[ k 07 n1 4 : ~ / d_omp ] $ . / h e l l o 2H e l l o W or ld f ro m t h r e ad 0T he re a r e 1 t h r ea d s[ k07n14 : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=4[ k 07 n1 4 : ~ / d_omp ] $ . / h e l l o 2H e l l o W or ld f ro m t h r e ad 1H e l l o W or ld f ro m t h r e ad 3

H e l l o W or ld f ro m t h r e ad 0H e l l o W or ld f ro m t h r e ad 2T he re a r e 4 t h r ea d s[ k 07n14 : ~ / d_omp ] $ . / s i m p l e _ i f o r t[ k 07n14 : ~ / d_omp ] $ . / s i m p l e _ i f o r t

myid = 3 nt hr ea d s 4myid = 2 nt hr ea d s 4Number o f a va i l a b l e p ro ce s s or s : 64myid = 0 nt hr ea d s 4

myid = 1 nt hr ea d s 4


Practical OpenMP Example - Molecular Dynamics

MD Sample Code



Let’s take this as a trial of parallelizing a real code:

Take the sample MD code from www.openmp.org

Modify it slightly for our environment (uncomment the line for use

omp_lib, add conditional compilation for the API function calls ...

Then do a quick profile to see where the code spends is spending

time ...



C il f i k fili d t t ti fil

http://www.openmp.org/

http://www.openmp.org/



Compile for quick profiling and run to generate run-time profile:

1 [ k 08n08a : ~ / d_omp ] $ i f o r t −O3 −o md1500−p g . x −g −p md1500. f902 [ k 08 n0 8a : ~ / d_omp ] $ / u s r / b i n / t i m e . / md1500−p g . x3 September 19 2011 1 : 39 : 35 . 15 6 PM

45 MD6 A m ol ec ul ar dynamics program .789 100 0.112109E+07 0.893929 0.158956E−10

10 200 0.112109E+07 3.63376 −0.189220E−1011 300 0.112108E+07 8.23009 −0.115307E−0912 400 0.112107E+07 14.7004 −0.304663E−09

13 500 0.112107E+07 23.0692 −0.619601E−0914 600 0.112106E+07 33.3684 −0.109145E−0815 700 0.112104E+07 45.6372 −0.175113E−0816 800 0.112103E+07 59.9221 −0.263625E−0817 900 0.112101E+07 76.2778 −0.377839E−0818 1000 0.112099E+07 94.7666 −0.521232E−081920 MD21 Normal end of e x ec ut io n .

2223 Sep tem ber 19 2011 1 : 4 2: 1 3 . 39 7 PM24 1 5 8.2 1 u se r 0 .0 0 syste m 2 :3 8 .2 5 e l a p sed 99%CPU (0 a vg t e xt +0 a vg da ta 8 03 2 ma xre si d e n t ) k25 0 i n p u t s +480 o u t p u t s ( 0 m a j o r +539 m i no r ) p a g e f a u l t s 0 swaps





Simple analysis based on profile:

1 [ k 08n08a : ~ / d_omp ] $ g p r of−−

l i n e . / md1500−

pg . x gmon . o u t > r e p o r t−

l i n e−

md1500. txt2 [ k 08n08a : ~ / d_omp ] $ l e s s r e p o r t−l i n e−md1500. t x t3 F l a t p r o f i l e :45 Each s am ple c o un t s a s 0 . 01 s ec on ds .6 % c u m u l a t i v e s e l f s e l f t o t a l7 t i me seconds seconds c a l l s ns / c a l l ns / c a l l name8 44.51 62.31 62.31 __l ib m_sse 2_s inc os9 18.27 87.89 25.58 compute ( md1500 . f9 0 :194 @ 4035b4 )

10 6.18 96.54 8.65 compute ( md1500 . f9 0 :194 @ 40359 f )11 6.18 105.18 8.65 compute ( md1500 . f9 0 :168 @ 4035a9 )12 3.63 110.26 5.08 2250748500 2.25 2.25 d i s t_ ( md1500 . f9 0 :266 @ 403680)13 3.20 114.74 4.48 compute ( md1500 . f9 0 :167 @ 40355d )14 3.11 119.10 4.36 compute ( md1500 . f9 0 :167 @ 403544)15 2.55 122.66 3.57 compute ( md1500 . f9 0 :192 @ 403561)16 2.06 125.55 2.89 compute ( md1500 . f9 0 :192 @ 403551)17 1.66 127.87 2.32 compute ( md1500 . f9 0 :194 @ 403569)18 1.39 129.82 1.95 compute ( md1500 . f9 0 :188 @ 403521)19 1.10 131.37 1.55 d i s t ( md1500 . f9 0 :300 @ 4036c1 )





... and now let us take a look at the critical code sections,

164 ! T hi s p o t e n t i a l i s a h arm onic w e l l which s mo ot hl y s a t ur a t es t o a165 ! maximum va l u e a t PI / 2 .166 !167 v ( x ) = ( s i n ( min ( x , P I2 ) ) )∗∗2168 dv ( x ) = 2 .0D+00 ∗ s in ( min ( x , PI2 ) ) ∗ cos ( min ( x , PI2 ) )

169170 p o t = 0. 0D+00171 k in = 0. 0D+00





and not too surprisingly, it is the loop over particles that updates forcesand momenta that is responsible for most of the consumed time:

178 do i = 1 , np179 !180 ! Compute t he p o t e n t i a l e ne rg y a nd f o r c e s .181 !182 f ( 1 : nd , i ) = 0 .0D+00183184 do j = 1 , np185186 i f ( i / = j ) then187188 c a l l d i s t ( nd , pos ( 1 , i ) , pos ( 1 , j ) , r i j , d )189 !190 ! A t t r i b u t e h a l f o f th e p o te n t i a l energy t o p a r t i c l e J .191 !192 po t = p o t + 0 . 5D+00 ∗ v ( d )193194 f ( 1 : nd , i ) = f ( 1 : nd , i ) − r i j ( 1 : nd ) ∗ dv ( d) / d





Adding OpenMP directives to this loop:

173 ! $OMP p a r a l l e l do &174 ! $OMP d e f a u l t ( s ha re d ) &175 !$OMP sh a red ( n d ) &176 !$OMP p r i v a te ( i , j , r i j , d ) &177 ! $OMP r e d u c t i o n ( + : p ot , k i n )178 do i = 1 , np179 !180 ! Compute t he p o t e n t i a l e ne rg y a nd f o r c e s .181 !182 f ( 1 : nd , i ) = 0 .0D+00183

184 do j = 1 , np185186 i f ( i / = j ) then187188 c a l l d i s t ( nd , pos ( 1 , i ) , pos ( 1 , j ) , r i j , d )189 !190 ! A t t r i b u t e h a l f o f th e p o te n t i a l energy t o p a r t i c l e J .191 !192 po t = p o t + 0 . 5D+00 ∗ v ( d )

193194 f ( 1 : nd , i ) = f ( 1 : nd , i ) − r i j ( 1 : nd ) ∗ dv ( d) / d



Using these OpenMP directives, what kind of speedup can we get?



[ k 14 n0 8b : ~ / d_omp ] $ m odu le l o a d i n t e l / 1 1 . 1[ k 14 n0 8b : ~ / d_omp ] $ i f o r t −O3 −o md1500. no−omp. x md1500. f9 0[ k 14 n0 8b : ~ / d_omp ] $ i f o r t −O3 −openmp −openmp_report2 −o md1500. omp. x md1500. f9 0md1500. f90 (6 5 ) : ( c o l . 7) rema rk : OpenMP DEFINED REGION WAS PARALLELIZED.md1500. f90 (9 4 ) : ( c o l . 10) rema rk : OpenMP DEFINED LOOP WAS PARALLELIZED.md1500. f90 ( 1 7 3 ) : ( c o l . 7) rema rk : OpenMP DEFINED LOOP WAS PARALLELIZED.md1500. f90 ( 3 5 7 ) : ( c o l . 7) rema rk : OpenMP DEFINED LOOP WAS PARALLELIZED.[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. no−omp. xS ep te mb er 19 2011 3 : 3 4 : 3 4 . 5 6 1 PM

MDA m o l e c u l a r d yn am ic s p ro gr am .

100 0.1 121 09E+07 0. 89 39 29 0 .15 89 56E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189220E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115307E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304663E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619601E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175113E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263625E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377839E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MDN or ma l end o f e x e c u t i o n .

S ep te mb er 19 2011 3 : 3 7 : 0 5 . 8 4 3 PM151.26user 0.00 system 2:31 .28 elapsed 99%CPU (0 av gte xt +0avgdata 4688maxresid ent ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +329 m i no r ) p a g e f a u l t s 0 swaps





[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=1[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 3 8 : 4 8 . 7 8 8 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s t h r e a d 0 o f 1

100 0.1 121 09E+07 0. 89 39 29 0 .15 89 56E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189222E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115307E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304663E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619601E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175113E−08

800 0 .1 12 10 3E+07 5 9. 92 21 −0.263625E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377839E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08


S ep te mb er 19 2011 4 : 4 1 : 4 1 . 7 1 6 PM172.90user 0.00 system 2:53 .12 elapsed 99%CPU (0 av gte xt +0avgdata 7632maxresid ent ) k

0 i n p u t s +0 o u t p u t s ( 0 m a j o r +519 m i no r ) p a g e f a u l t s 0 swaps

So the OpenMP overhead is reflected in S (1) 0.87.





[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=2[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 2 2 : 4 6 . 5 3 8 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s t h r e a d 0 o f 2Th i s i s t h r e a d 1 o f 2


1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08


S ep te mb er 19 2011 4 : 2 4 : 1 4 . 5 7 7 PM175.85user 0.03 system 1:28 .06 elapsed 199%CPU (0 avg tex t+0avgdata 7920maxreside nt ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +539 m i no r ) p a g e f a u l t s 0 swaps

For 2 threads, we are up to S (2) 1.7.



[ k14n08b : ~ / d omp ] $ e x p o r t OMP NUM THREADS=4



[ k14n08b : / d_omp ] $ e x p o r t OMP_NUM_THREADS=4[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 3 1 : 1 9 . 4 0 9 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .

Th i s i s t h r e a d 0 o f 4Th i s i s t h r e a d 3 o f 4Th i s i s t h r e a d 2 o f 4Th i s i s t h r e a d 1 o f 4

100 0.1 121 09E+07 0. 89 39 29 0 .15 88 75E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189224E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115307E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304674E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619610E−09

600 0 .1 12 10 6E+07 3 3. 36 84−

0.109145E−

08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175112E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263624E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377839E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08






[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=8[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 3 3 : 1 6 0 0 0 PM



S ep te mb er 19 2011 4 : 3 3 : 1 6 . 0 0 0 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s t h r e a d 0 o f 8Th i s i s t h r e a d 4 o f 8

Th i s i s t h r e a d 6 o f 8Th i s i s t h r e a d 5 o f 8Th i s i s t h r e a d 7 o f 8Th i s i s t h r e a d 3 o f 8Th i s i s t h r e a d 2 o f 8Th i s i s t h r e a d 1 o f 8

100 0.1 121 09E+07 0. 89 39 29 0 .15 88 56E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189249E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115308E−09

400 0 .1 12 10 7E+07 1 4. 70 04 −0.304673E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619612E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175113E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263624E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377840E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MD

N or ma l end o f e x e c u t i o n .





[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=12[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. x



S ep te mb er 19 2011 4 : 3 4 : 0 2 . 6 3 8 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s th r e a d 0 o f 12

Th i s i s th r e a d 4 o f 12. . .Th i s i s t h r e a d 11 of 12Th i s i s th r e a d 3 o f 12Th i s i s th r e a d 2 o f 12Th i s i s th r e a d 1 o f 12


1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08


S ep te mb er 19 2011 4 : 3 4 : 1 7 . 1 6 9 PM173.98user 0.00 system 0:14 .60 elapsed 1191%CPU (0 av gte xt +0avgdata 17712max resident ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +680 m i no r ) p a g e f a u l t s 0 swaps

For 12 threads (this is a 12-core node), we are up to S (12) 10.4.


class09_openmp, ii.pdf

Documents