class09_openmp, ii.pdf

58
Practical Issues in OpenMP M. D. Jones, Ph.D. Center for Computational Research University at Buffalo State University of New York High Performance Computing I, 2012 M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 1 / 61

Upload: pavan-behara

Post on 04-Apr-2018

245 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 1/58

Practical Issues in OpenMP

M. D. Jones, Ph.D.

Center for Computational ResearchUniversity at Buffalo

State University of New York

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 1 / 61

Page 2: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 2/58

Scheduling Load Balance

Loop Scheduling

The way in which iterations of a parallel loop get assigned to

threads is determined by the loop’s schedule

Default scheduling typically assumes an equal load balance,frequently the case that different iterations can have entirelydifferent computational loads

Load imbalance can cause significant synchronization delays

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 3 / 61

Page 3: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 3/58

Scheduling Load Balance

Static vs. Dynamic Scheduling

Basic distinction of loop scheduling:

Static: iteration assignment to threads determined as function of

iteration/thread number

Dynamic: assignment can vary at run-time, and iterations arehanded out to threads as they complete previouslyassigned iterations

Iterations in both schemes can be assigned in chunks

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 4 / 61

Page 4: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 4/58

Scheduling Load Balance

SCHEDULE Clause

The general form of the SCHEDULE clause:

SCHEDULE clause

schedule(type [,chunk])

where type can be one of:static without chunk, threads given equally sized subdivision of

iterations (exact placement implementation-dependent).With chunk, iterations divided into chunk-sized pieces,remainder allocation is implementation dependent

dynamic iterations divided into chunks (default is one if chunk notpresent), assigned dynamically at run-time

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 5 / 61

Page 5: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 5/58

Scheduling Load Balance

guided first chunk size determined by implementation, then

subsequently decreased exponentially (value isimplementation-dependent) to minimum size specified bychunk (default 1)

runtime chunk must not appear, schedule determined by value ofenvironmental variable OMP_SCHEDULE

auto (OpenMP 3.0) gives implementation freedom to choosebest mapping of iterations to threads

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 6 / 61

S h d li L d B l

Page 6: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 6/58

Scheduling Load Balance

Scheduling Considerations

Things to consider when choosing between scheduling options

Dynamic schedules can better balance the load between threads,

but typically have higher overhead costs (synchronization costsper chunk)

Guided schedules have the advantage of typically requiring fewerchunks (translates to fewer synchronizations) - typically the initial

chunk size is roughly the number of iterations divided by thenumber of threads

Simple static has the lowest overhead, but is most susceptible toload imbalances

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 7 / 61

O MP I & G t h N ti

Page 7: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 7/58

OpenMP Issues & Gotchas Nesting

Easy to Use?

OpenMP does not force the programmer to explicitly manage

communication or how the program data is mapped onto

individual processors - sounds great ...

OpenMP program can easily run into common SMP programmingerrors, usually from resource contention issues.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 9 / 61

OpenMP Issues & Gotchas Nesting

Page 8: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 8/58

OpenMP Issues & Gotchas Nesting

Directive Nesting

DO/for, SECTIONS, SINGLE, and WORKSHARE directives that bind tothe same parallel region are not allowed to be nested.

DO/for, SECTIONS, SINGLE, and WORKSHARE directives are notallowed in the dynamical extent of CRITICAL, ORDERED, and MASTERdirectives.

BARRIER and MASTER are not permitted in the dynamic extent ofDO/for, SECTIONS, SINGLE, WORKSHARE, MASTER, CRITICAL, andORDERED directives.

ORDERED must appear in the dynamical extent of a DO or PARALLEL

DO with an ORDERED clause. ORDERED is not allowed in thedynamical extent of SECTIONS, SINGLE, WORKSHARE, CRITICAL,and MASTER.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 10 / 61

OpenMP Issues & Gotchas Shared vs Private

Page 9: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 9/58

OpenMP Issues & Gotchas Shared vs. Private

Data Storage Defaults

Most variables are SHARED by default

Fortran: common blocks, save variables , MODULE variables.

C: file scope variables, static variables.with some exceptions ...

stack variables in sub-programs called from a PARALLEL region.automatic variables within a statement blockloop indices (in C just on “work-shared” loops)

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 11 / 61

OpenMP Issues & Gotchas Shared vs Private

Page 10: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 10/58

OpenMP Issues & Gotchas Shared vs. Private

Data Storage Gotchas

Assumed size and assumed shape arrays can not be privatized.

Fortran allocatable arrays (and pointers) can be PRIVATE or

SHARED, but not FIRSTPRIVATE or LASTPRIVATE.

Constituent elements of a PRIVATE(FIRSTPRIVATE/LASTPRIVATE) name common block can not bedeclared in another data scope clause.

Privatized elements of shared common blocks are no longer

storage equivalent with the common block.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 12 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 11: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 11/58

OpenMP Issues & Gotchas Synchronization & Barriers

Synchronization Awareness

Implied Barriers :

1 END PARALLEL

2 END DO (unless NOWAIT)3 END SECTIONS (unless NOWAIT)4 END CRITICAL5 END SINGLE (unless NOWAIT)

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 13 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 12: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 12/58

OpenMP Issues & Gotchas Synchronization & Barriers

Implied Flushes :

1 BARRIER2 CRITICAL/END CRITICAL3 END DO4 END PARALLEL5 END SECTIONS6 END SINGLE7 ORDERED/END ORDERED

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 14 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 13: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 13/58

p y

Synchronization Costs

Overhead for synchronization on an SGI Origin 2000 (MIPS250MHz R10000 processors)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]1 2.0 2.3 0.1 2.1

2 8.4 7.8 0.4 11.04 11.6 6.8 1.5 20.78 28.0 14.1 3.1 31.0

10µs? Isn’t that pretty small?

10µs×250MHz =2500 clock cycles - lost computation.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 15 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 14: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 14/58

p y

Synchronization Costs (cont’d)

Overhead for synchronization on an SGI Altix 3700 (Intel1300MHz Itanium2 processors)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]1 0.3 0.3 0.1 0.52 2.3 2.1 0.4 2.6

4 5.9 4.7 0.4 9.68 6.6 6.8 0.5 24.1

16 10.3 10.7 0.6 60.732 19.2 19.3 0.7 13264 41.8 40.9 0.7 316

10µs? Isn’t that pretty small?

10µs×1300MHz =13000 clock cycles - lost computation.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 16 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 15: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 15/58

Synchronization Costs (cont’d)

Overhead for synchronization on an Intel “Clovertown” (dual

quad-core 1.866GHz Xeon processors)Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]

1 0.2 0.2 0.02 0.22 1.6 1.7 0.08 2.04 2.3 2.4 0.14 3.18 3.8 3.9 0.52 5.8

5.8µs×1866MHz =10823 clock cycles - lost computation.

Overhead for synchronization on an Intel “Nehalem” (dualquad-core 2.8GHz Xeon processors)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]1 0.1 0.1 0.01 0.1

2 1.1 1.1 0.04 1.24 1.2 1.2 0.05 1.58 1.7 1.8 0.05 2.5

2.5µs×2800MHz =7000 clock cycles - lost computation.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 17 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 16: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 16/58

Synchronization Costs (cont’d)

Overhead for synchronization on a 32-core Intel "Westmere"2130MHz system (4 sockets, 8 cores/socket, Xeon E7-4530)

Nthreads PARALLEL[µs] DO[µs] ATOMIC[µs] REDUCTION[µs]

1 0.1 0.2 0.02 0.22 2.2 2.3 0.04 2.74 2.9 3.1 0.07 4.28 3.9 3.9 0.07 6.8

16 4.8 5.2 0.07 12.232 15.4 6.9 0.07 24.9

25µs×2130MHz =53250 clock cycles - lost computation.

Not exactly great progress ...

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 18 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 17: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 17/58

Common Errors

Race conditions : outcome of the program depends on detailed

scheduling of thread team (the answer is different every

time I run the code!).

Deadlock : threads wait forever for a locked resource to become

free.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 19 / 61

OpenMP Issues & Gotchas Synchronization & Barriers

Page 18: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 18/58

Race Conditions

What is wrong with this code fragment?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x )3 do i =1,10000

4 tmp=dosomework ( i )5 x=x+tmp6 end do7 !$OMP END DO8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 20 / 61

OpenMP Issues & Gotchas Race Conditions & Deadlock

Page 19: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 19/58

Race Conditions

What is wrong with this code fragment?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x )3 do i =1,100004 tmp=dosomework ( i )

5 x=x+tmp6 end do7 !$OMP END DO8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL

The programmer did not make tmp PRIVATE, hence the results

are unpredictable.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 21 / 61

OpenMP Issues & Gotchas Race Conditions & Deadlock

R C di i

Page 20: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 20/58

Race Conditions

What about now?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x ) ,PRIVATE( tmp )3 do i =1,10000

4 tmp=dosomework ( i )5 x=x+tmp6 end do7 !$OMP END DO NOWAIT8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 22 / 61

OpenMP Issues & Gotchas Race Conditions & Deadlock

R C diti

Page 21: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 21/58

Race Conditions

What about now?

1 r e a l t m p , x2 !$OMP PARALLEL DO REDUCTION( + : x ) ,PRIVATE( tmp )3 do i =1,100004 tmp=dosomework ( i )

5 x=x+tmp6 end do7 !$OMP END DO NOWAIT8 y ( iam ) = work ( x , iam )9 !$OMP END PARALLEL

The value of x is not dependable without the barrier at the end of

the DO construct - be careful with NOWAIT!

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 23 / 61

OpenMP Issues & Gotchas Race Conditions & Deadlock

D dl k

Page 22: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 22/58

Deadlock

A somewhat artificial example of deadlock - watch that resourcesare freed if you are using locks!

1 c a l l OMP_INIT_LOCK( lock0 )2 !$OMP PARALLEL SECTIONS3 !$OMP SECTION

4 c a l l OMP_SET_LOCK( l o c k 0 )5 i r e t = d o l o t s o f w o r k ( )6 i f ( i r e t . l e . t o l ) then7 c a l l OMP_UNSET_LOCK( l o c k 0 )8 else9 c a l l e r r o r ( i r e t )

10 e n d i f11 !$OMP SECTION12 c a l l OMP_SET_LOCK( l o c k 0 )

13 c a l l co mp u te (A,B, i re t )14 c a l l OMP_UNSET_LOCK( l o c k 0 )15 $ !OMP END SECTIONS

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 24 / 61

OpenMP Issues & Gotchas Load Balancing

L d B l i

Page 23: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 23/58

Load Balancing

Consider the following code fragment - can you see why it not

efficient to parallelize on the outer loop?

1 do i =1,N2 do j =1, i3 a ( j , i )= a ( j , i )+ b ( j )∗c ( i )4 end do5 end do

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 25 / 61

OpenMP Issues & Gotchas Load Balancing

Load Balancing

Page 24: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 24/58

Load Balancing

One strategy - break up the loop into interleaved chunks,

1 !$OMP PARALLEL SHARED ( nu m_ th re ad s )2 !$OMP SINGLE3 num_threads = OMP_GET_NUM_THREADS( )4 !$OMP END SINGLE NOWAIT5 !$OMP END PARALLEL6 !$OMP PARALLEL DO PRIVATE ( i , j , k )7 do k = 1 , n um _t hr ea ds8 do i = k , n , n um _t hr ea ds9 do j = 1 , i

10 a ( j , i ) = a ( j , i ) + b ( j )∗c ( j )11 end do12 end do13 end do

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 26 / 61

OpenMP Issues & Gotchas Load Balancing

Load Balancing

Page 25: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 25/58

Load Balancing

Another equivalent (and somewhat cleaner!) way,

!$OMP PARALLEL DO PRIVATE ( i , j ) SCHEDULE( s t a t i c , 4 )

do i =1,ndo j =1, i

a( j , i )=a ( j , i )+b( j )∗c ( j )end do

end do

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 27 / 61

OpenMP Issues & Gotchas Load Balancing

Toward Coarser Grains

Page 26: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 26/58

Toward Coarser Grains

What is wrong with fine grain (loop) parallelism?

Overhead kills performance

Not scalable to large number of threads

S (N p ) =τ s  + τ p 

τ s  + τ p /P =

1

S + (1− S )/P 

Remember Amdahl’s law!

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 28 / 61

OpenMP Issues & Gotchas Coarsening

Coarsening

Page 27: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 27/58

Coarsening

Strategies for increasing OpenMP performance,

do more work per parallel region, and decrease fraction of time

spent in sequential code.

reduce synchronization across threads

combine multiple parallel do directives into larger parallel region(with work-sharing constructs therein)

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 29 / 61

OpenMP Issues & Gotchas Coarsening

Coarsening (cont’d)

Page 28: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 28/58

Coarsening (cont d)

Domain Decomposition 

Break Data domain into sub-domains,

Compute loop bounds once depending on number of threads (a

priori loop decomposition),

Reduces loop overhead, but shifts burden from compiler back tothe programmer,

Implements the Single Program Multiple Data model (SPMD).

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 30 / 61

OpenMP Issues & Gotchas Coarsening

Coarse Grain SPMD Example

Page 29: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 29/58

Coarse Grain SPMD Example

1 program spmd2 $ !OMP PARALLEL DEFAULT( PRIVATE ) SHARED( N, g l o b a l )3 num_threads = OMP_GET_NUM_THREADS( )4 iam = OMP_GET_THREAD_NUM( )5 i c h u n k = N / num_threads6 i b e g i n = iam∗ i ch u n k

7 i e n d = i b e g i n + i ch u n k − 18 c a l l l o t s o f w o r k ( i b e g i n , ie n d , l o c a l )9 $ !OMP ATOMIC

10 g l o b a l = g l o b a l + l o c a l11 $ !OMP END PARALLEL12 p r i n t ∗ , g l ob a l13 end program spmd

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 31 / 61

OpenMP Issues & Gotchas Coarsening

Coarse Grain SPMD Example

Page 30: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 30/58

Coarse Grain SPMD Example

!$& DEFAULT(PRIVATE)

!$& SHARED(M,global)

!$OMP PARALLEL ...

program spmd

.

.

.

ibegin

iend

local

ibegin

iend

local

ibegin

iend

local

ibegin

iend

local

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 32 / 61

OpenMP Issues & Gotchas Coarsening

SPMD Implementation

Page 31: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 31/58

SPMD Implementation

Manual decomposition - valid for any number of threads (makesure that cost/benefit ratio is high enough!)

Same program on each thread, but a different (PRIVATE)sub-domain of the program data.

Synchronization necessary to handle global variable updates(ATOMIC usually more efficient than CRITICAL).

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 33 / 61

OpenMP Issues & Gotchas Thread-safety

Thread Safety Issues

Page 32: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 32/58

Thread Safety Issues

Certainly one must be careful about hidden state issues when callingfunctions/routines from multiple threads:

MPI - check your level of thread safety with MPI_Init_thread

and program accordingly

Other functions - Up to you to check and ensure thread-safefunctions (danger in treating any function as a black box)

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 34 / 61

OpenMP Issues & Gotchas Thread-safety

Thread-safe Example

Page 33: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 33/58

Thread safe Example

From the rand man page (section 3, RHEL 5):

# i n c lu d e < s t d l i b . h>

i n t r an d ( v o i d ) ;

i n t r an d_ r ( u n si gn ed i n t ∗seedp ) ;

v o i d s ra nd ( u ns ig ne d i n t s eed ) ;

DESCRIPTIONThe r an d ( ) f u n c t i o n r e t u r n s a p seudo−random i n t e g e r between 0 and RAND_MAX.

The s ran d ( ) f u n c t i o n s et s i t s argument as t h e seed f o r a new sequence o fpseudo−random i n t e g e r s t o be r e t u r n e d by r an d ( ) . These s eq ue nc es a r e r e p e a ta b l eby c a l l i n g s ra nd ( ) w i t h t h e s ame s eed v a l ue .

I f no s eed v al ue i s p ro vi de d , t he r and ( ) f u n c t i o n i s a u t o ma t i c a ll y seeded w i th a v al ueo f 1 .

The f u n c t i o n r and ( ) i s n ot r e e n tr a n t o r t hr ea d−s af e , s i nc e i t uses hi dde n s t a t e t h a t i s

m od if ie d o n each c a l l . T hi s m ig ht j u s t be t he seed v al ue t o be u sed b y t he n ex t c a l l ,o r i t m ig ht be s om et hi ng more e l a bo r a t e . I n o r de r t o g et r e p r o d uc i b l e b eh av io ur i n at hr ea de d a p p l ic a t i o n , t h i s s t a t e must b e made e x p l i c i t . The f u n c t i on r an d_ r ( ) i ss u pp li e d w i th a p o i n t er t o a n un si gn ed i n t , t o b e used a s s t a t e . T hi s i s a v e ry s ma llamount o f s t a te , so t h i s f u n c t i o n w i l l be a w eak p seudo−rand om g e n e ra t o r . Tryd r an d 48 _ r ( 3 ) i n s t e a d .

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 35 / 61

OpenMP Issues & Gotchas C/C++ Max/Min

Lack of Max/Min in C/C++

Page 34: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 34/58

Lack of Max/Min in C/C++

Due to a lack of an intrinsic max/min function in C/C++, we have no

built-in reduction operator in OpenMP, so one way to do so is to haveeach thread track its max/min value, and then update the globalmax/min accordingly with a protective directive:

1 #pragma omp p a r a l l e l p r i v a t e ( my_amax )2 {3 amax = 0 ;

4 my_amax = 0 ;5 / ∗ use p r i v a t e v a r i a b l e f o r max p er t hr ea d ∗ / 6 #pragma omp f o r7 f o r ( i =0; i <=N; i ++) {8 i f (a [ i ] > my_amax) {9 my_amax = a [ i ] ;

10 }11 } / ∗ g l o b a l up dat e , r e q u i r e s o n ly num_t hreads c r i t i c a l e v a l u at i o n s ∗ / 12 #pragma omp c r i t i c a l13 i f (my_amax > amax) {14 {15 amax = my_amax ;16 }17 }18 }

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 36 / 61

OpenMP Issues & Gotchas C/C++ Max/Min

Max/Min with Locks

Page 35: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 35/58

a / t oc s

Another way to do max/min, this time with OpenMP locks:

1 omp_lock_t MAXLOCK;23 o mp _ i n i t _ l o c k (&MAXLOCK) ;45 #pragma omp p a r a l l e l f o r6 f o r ( i = 0 ; i < Nu mb er of el em en ts ; i + +) {

7 i f ( a r r a y [ i ] > c ur _m ax ) {8 om p_s et _lo ck (&MAXLOCK ) ;9 i f ( a r r a y [ i ] > c ur _m ax ) {

10 cur_max = a r r a y [ i ] ;11 }12 o mp_ un se t_ loc k (&MAXLOCK ) ;13 }1415 / ∗ D e s t r o y i n g The L oc k ∗ / 

16 omp_des troy_l ock (&MAXLOCK) ;

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 37 / 61

OpenMP Issues & Gotchas C/C++ Max/Min

Example - Compare Max/Min with Critical vs. Lock

Page 36: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 36/58

p p

Compare the two methods - find the max in a randomly seeded array

of varying size, serially and using the OpenMP critical and lock method

(note that the outcome should be pretty obvious based on the twocoding examples, but you can tinker with them to make the distinctionless clear).

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 38 / 61

OpenMP Issues & Gotchas Thread Affinity

Thread Affinity

Page 37: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 37/58

y

There are many times in which you may wish or need to specify howyour compute threads get mapped to the physical (do not confuse

physical with logical here) CPU cores:

Contention for cache memory

Contention for network interfaces (especially when combined withmessage-passing)

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 39 / 61

OpenMP Issues & Gotchas Thread Affinity

GNU Options

Page 38: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 38/58

p

The GNU compilers currently support an option for CPU affinity, aswell as an option for adjusting the available stack space per thread:

GOMP_CPU_AFFINITY : space-separated or comma-separated listof CPUs, either single CPU numbers in any order, a range

of CPUs (M-N) or a range with some stride (M-N:S). Notethat cores are counted starting from 0. Note that this view

of CPU core reflects that of the operating system, whichin many cases is not the full picture of the underlyinghardware topology.

GOMP_STACKSIZE : sets the default thread stack size in kilobytes.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 40 / 61

OpenMP Issues & Gotchas Thread Affinity

Intel Options

Page 39: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 39/58

p

The Intel compilers support a much richer set of utilities for controllingthe placement of threads:

KMP_AFFINITY is the environment variable used, although theIntel run-time will also respect the GOMP_CPU_AFFINITY variable

at a lower level of precedence.

Details can be found (and are frequently changed as the

architecture evolves) in the compiler documentation, but the latestas of this writing can be found at:

http://software.intel.com/sites/products/

documentation/hpc/compilerpro/en-us/fortran/win/

compiler_f/optaps/common/optaps_openmp_thread_ 

affinity.htmBest bet is to review the documentation for the compiler that youare trying to use.

Extremely helpful when simultaneous multi-threading (also knownas hyper-threading) is turned on.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 41 / 61

OpenMP Issues & Gotchas OpenMP vs. MPI

Advantages over Message Passing

Page 40: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 40/58

Domain decomposition methodology is the same, butimplementing it in OpenMP can be easier, as global data can be

read without any need for synchronization or message passing.Parallelize only parts of the code that require it (profiling is key!).Pre and Post Processing can be left sequential.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 42 / 61

OpenMP Issues & Gotchas OpenMP vs. MPI

Best of Both Worlds?

Page 41: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 41/58

How about combining OpenMP with Message Passing?

Message Passing between machines, OpenMP within.

Allow application dependent mixing within a shared memoryenvironment.

Coarse grain with Message Passing, fine grain with OpenMP.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 43 / 61

Practical OpenMP Compiler Support

Platforms & Compilers

Page 42: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 42/58

This table lists the various compiler suites available on the productioncomputing platforms along with their OpenMP compiliance:

Platform Compiler OMP Invocation

Linux IA64 Gnu (g77/gcc/g++) No –  Intel (ifort/icc/icpc) 2.5 -openmp -openmp_re

Linux x86_64 Gnua  (g77/gcc/g++) 2.5(>4.1),3.0(>4.4),3.1(>4.7) – PGI (pgf90/pgcc/pgCC) 2.5, 3.0(≥ 12.0) -mpIntel (ifort/icc/icpc) 2.5,3.0(≥ 11.0), 3.1(≥ 12.1) -openmp -openmp_re

a The Gnu compiler suite supports OpenMP for versions >4.2, although some Linux distributions

(e.g. RedHat) have backported support to 4.1

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 45 / 61

Practical OpenMP Example - Simple

Simple OpenMP example

Page 43: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 43/58

program si mp l eUSE omp_lib ! comment o u t f o r p g f9 0 − i f n o t openmp 2 . 0 c o m p l i a n ti m p l i c i t none

i n t e g e r : : m yid , n t hr ea ds , n pr oc s! i n cl u de t h i s d e c l a r at i o n f o r pg f90! in t e g e r : : OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM,OMP_GET_NUM_PROCS

!$OMP PARALLEL de fa ul t ( none) pr iv at e ( myid ) &

!$OMP shared ( nthre ads , nprocs )!! D et er mi ne t he number o f t h re a ds and t h e i r i d!myid = OMP_GET_THREAD_NUM( )n th re ad s = OMP_GET_NUM_THREADS ( ) ;nprocs = OMP_GET_NUM_PROCS ( ) ;!$OMP BARRIERi f (myid==0) p r i n t ∗ , ’ Number o f a v a i l a b l e p r oc e s so r s : ’ , n p r oc sp r i n t ∗ , ’ m yid = ’ , myid , ’ n th re ad s ’ , n th re ad s!$OMP END PARALLELend program si mp l e

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 46 / 61

Practical OpenMP Example - Simple

Altix - simple example

Page 44: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 44/58

[ j on esm@le nn on ~/d_omp ] $ mod ul e l o a d i n t e l[ jonesm@lennon ~/d_omp] $ i f o r t −O3 −o s i m p l e _ i f o r t −openmp −openmp_report2simpl e . f90si mp l e . f9 0 (1 9 ) : ( co l . 6 ) re ma rk : OpenMP mu l t i th re a d e d cod e g e n e ra t i o n BARRIERwas s u c c e s s f u l .si mp le . f9 0 ( 9 ) : ( c o l . 6) rem ark : OpenMP DEFINED REGION WAS PARALLELIZED .

[ jonesm@lennon ~ /d_omp ] $ se te nv OMP_NUM_THREADS 4[ j on esm@le nn on ~/d_omp ] $ . / si mp l e _ i fo rtmyid = 1 nt hr ea d s 4myid = 3 nt hr ea d s 4myid = 2 nt hr ea d s 4Number o f a v a i l a b l e p ro ce s s or s : 4myid = 0 nt hr ea d s 4

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 47 / 61

Practical OpenMP Example - Simple

U2 - simple example

Page 45: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 45/58

[ jonesm@bono ~/d_omp] $ module load i n t e l[ jonesm@bono ~/d_omp] $ i f o r t −O3 −o s i m p l e _ i f o r t −openmp sim ple . f90[ jonesm@bono ~ /d_omp ] $ set en v OMP_NUM_THREADS 4[ jonesm@bono ~/d_omp ] $ . / si mp l e _ i fo rt

Number o f a v a i l a b l e p ro ce s s or s : 4myid = 1 nt hr ea d s 4myid = 0 nt hr ea d s 4myid = 2 nt hr ea d s 4myid = 3 nt hr ea d s 4

[ jonesm@bono ~/d_omp ] $ module loa d pg i[ jonesm@bono ~/ d_omp] $ pgf9 0 −O3 −mp−o s i m p l e _ p g i s i m p l e . f 9 0[ jonesm@bono ~/d_omp] $ . / sim ple_ pgi

Number o f a v a i l a b l e p ro ce s s or s : 4myid = 0 n th re a d s 4myid = 3 n th re a d s 4

myid = 1 n th re a d s 4myid = 2 n th re a d s 4

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 48 / 61

Practical OpenMP Example - Simple

U2 - simple example

Page 46: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 46/58

[ k07 n1 4 : ~/ d_omp ] $ g cc −fopenmp −o h e l lo 2 h e l lo 2 . c[ k07n14 : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=1[ k 07 n1 4 : ~ / d_omp ] $ . / h e l l o 2H e l l o W or ld f ro m t h r e ad 0T he re a r e 1 t h r ea d s[ k07n14 : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=4[ k 07 n1 4 : ~ / d_omp ] $ . / h e l l o 2H e l l o W or ld f ro m t h r e ad 1H e l l o W or ld f ro m t h r e ad 3

H e l l o W or ld f ro m t h r e ad 0H e l l o W or ld f ro m t h r e ad 2T he re a r e 4 t h r ea d s[ k 07n14 : ~ / d_omp ] $ . / s i m p l e _ i f o r t[ k 07n14 : ~ / d_omp ] $ . / s i m p l e _ i f o r t

myid = 3 nt hr ea d s 4myid = 2 nt hr ea d s 4Number o f a va i l a b l e p ro ce s s or s : 64myid = 0 nt hr ea d s 4

myid = 1 nt hr ea d s 4

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 49 / 61

Practical OpenMP Example - Molecular Dynamics

MD Sample Code

Page 47: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 47/58

Let’s take this as a trial of parallelizing a real code:

Take the sample MD code from www.openmp.org

Modify it slightly for our environment (uncomment the line for use

omp_lib, add conditional compilation for the API function calls ...

Then do a quick profile to see where the code spends is spending

time ...

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 50 / 61

Practical OpenMP Example - Molecular Dynamics

C il f i k fili d t t ti fil

Page 48: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 48/58

Compile for quick profiling and run to generate run-time profile:

1 [ k 08n08a : ~ / d_omp ] $ i f o r t −O3 −o md1500−p g . x −g −p md1500. f902 [ k 08 n0 8a : ~ / d_omp ] $ / u s r / b i n / t i m e . / md1500−p g . x3 September 19 2011 1 : 39 : 35 . 15 6 PM

45 MD6 A m ol ec ul ar dynamics program .789 100 0.112109E+07 0.893929 0.158956E−10

10 200 0.112109E+07 3.63376 −0.189220E−1011 300 0.112108E+07 8.23009 −0.115307E−0912 400 0.112107E+07 14.7004 −0.304663E−09

13 500 0.112107E+07 23.0692 −0.619601E−0914 600 0.112106E+07 33.3684 −0.109145E−0815 700 0.112104E+07 45.6372 −0.175113E−0816 800 0.112103E+07 59.9221 −0.263625E−0817 900 0.112101E+07 76.2778 −0.377839E−0818 1000 0.112099E+07 94.7666 −0.521232E−081920 MD21 Normal end of e x ec ut io n .

2223 Sep tem ber 19 2011 1 : 4 2: 1 3 . 39 7 PM24 1 5 8.2 1 u se r 0 .0 0 syste m 2 :3 8 .2 5 e l a p sed 99%CPU (0 a vg t e xt +0 a vg da ta 8 03 2 ma xre si d e n t ) k25 0 i n p u t s +480 o u t p u t s ( 0 m a j o r +539 m i no r ) p a g e f a u l t s 0 swaps

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 51 / 61

Practical OpenMP Example - Molecular Dynamics

Page 49: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 49/58

Simple analysis based on profile:

1 [ k 08n08a : ~ / d_omp ] $ g p r of−−

l i n e . / md1500−

pg . x gmon . o u t > r e p o r t−

l i n e−

md1500. txt2 [ k 08n08a : ~ / d_omp ] $ l e s s r e p o r t−l i n e−md1500. t x t3 F l a t p r o f i l e :45 Each s am ple c o un t s a s 0 . 01 s ec on ds .6 % c u m u l a t i v e s e l f s e l f t o t a l7 t i me seconds seconds c a l l s ns / c a l l ns / c a l l name8 44.51 62.31 62.31 __l ib m_sse 2_s inc os9 18.27 87.89 25.58 compute ( md1500 . f9 0 :194 @ 4035b4 )

10 6.18 96.54 8.65 compute ( md1500 . f9 0 :194 @ 40359 f )11 6.18 105.18 8.65 compute ( md1500 . f9 0 :168 @ 4035a9 )12 3.63 110.26 5.08 2250748500 2.25 2.25 d i s t_ ( md1500 . f9 0 :266 @ 403680)13 3.20 114.74 4.48 compute ( md1500 . f9 0 :167 @ 40355d )14 3.11 119.10 4.36 compute ( md1500 . f9 0 :167 @ 403544)15 2.55 122.66 3.57 compute ( md1500 . f9 0 :192 @ 403561)16 2.06 125.55 2.89 compute ( md1500 . f9 0 :192 @ 403551)17 1.66 127.87 2.32 compute ( md1500 . f9 0 :194 @ 403569)18 1.39 129.82 1.95 compute ( md1500 . f9 0 :188 @ 403521)19 1.10 131.37 1.55 d i s t ( md1500 . f9 0 :300 @ 4036c1 )

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 52 / 61

Practical OpenMP Example - Molecular Dynamics

Page 50: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 50/58

... and now let us take a look at the critical code sections,

164 ! T hi s p o t e n t i a l i s a h arm onic w e l l which s mo ot hl y s a t ur a t es t o a165 ! maximum va l u e a t PI / 2 .166 !167 v ( x ) = ( s i n ( min ( x , P I2 ) ) )∗∗2168 dv ( x ) = 2 .0D+00 ∗ s in ( min ( x , PI2 ) ) ∗ cos ( min ( x , PI2 ) )

169170 p o t = 0. 0D+00171 k in = 0. 0D+00

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 53 / 61

Practical OpenMP Example - Molecular Dynamics

Page 51: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 51/58

and not too surprisingly, it is the loop over particles that updates forcesand momenta that is responsible for most of the consumed time:

178 do i = 1 , np179 !180 ! Compute t he p o t e n t i a l e ne rg y a nd f o r c e s .181 !182 f ( 1 : nd , i ) = 0 .0D+00183184 do j = 1 , np185186 i f ( i / = j ) then187188 c a l l d i s t ( nd , pos ( 1 , i ) , pos ( 1 , j ) , r i j , d )189 !190 ! A t t r i b u t e h a l f o f th e p o te n t i a l energy t o p a r t i c l e J .191 !192 po t = p o t + 0 . 5D+00 ∗ v ( d )193194 f ( 1 : nd , i ) = f ( 1 : nd , i ) − r i j ( 1 : nd ) ∗ dv ( d) / d

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 54 / 61

Practical OpenMP Example - Molecular Dynamics

Page 52: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 52/58

Adding OpenMP directives to this loop:

173 ! $OMP p a r a l l e l do &174 ! $OMP d e f a u l t ( s ha re d ) &175 !$OMP sh a red ( n d ) &176 !$OMP p r i v a te ( i , j , r i j , d ) &177 ! $OMP r e d u c t i o n ( + : p ot , k i n )178 do i = 1 , np179 !180 ! Compute t he p o t e n t i a l e ne rg y a nd f o r c e s .181 !182 f ( 1 : nd , i ) = 0 .0D+00183

184 do j = 1 , np185186 i f ( i / = j ) then187188 c a l l d i s t ( nd , pos ( 1 , i ) , pos ( 1 , j ) , r i j , d )189 !190 ! A t t r i b u t e h a l f o f th e p o te n t i a l energy t o p a r t i c l e J .191 !192 po t = p o t + 0 . 5D+00 ∗ v ( d )

193194 f ( 1 : nd , i ) = f ( 1 : nd , i ) − r i j ( 1 : nd ) ∗ dv ( d) / d

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 55 / 61

Practical OpenMP Example - Molecular Dynamics

Using these OpenMP directives, what kind of speedup can we get?

Page 53: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 53/58

[ k 14 n0 8b : ~ / d_omp ] $ m odu le l o a d i n t e l / 1 1 . 1[ k 14 n0 8b : ~ / d_omp ] $ i f o r t −O3 −o md1500. no−omp. x md1500. f9 0[ k 14 n0 8b : ~ / d_omp ] $ i f o r t −O3 −openmp −openmp_report2 −o md1500. omp. x md1500. f9 0md1500. f90 (6 5 ) : ( c o l . 7) rema rk : OpenMP DEFINED REGION WAS PARALLELIZED.md1500. f90 (9 4 ) : ( c o l . 10) rema rk : OpenMP DEFINED LOOP WAS PARALLELIZED.md1500. f90 ( 1 7 3 ) : ( c o l . 7) rema rk : OpenMP DEFINED LOOP WAS PARALLELIZED.md1500. f90 ( 3 5 7 ) : ( c o l . 7) rema rk : OpenMP DEFINED LOOP WAS PARALLELIZED.[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. no−omp. xS ep te mb er 19 2011 3 : 3 4 : 3 4 . 5 6 1 PM

MDA m o l e c u l a r d yn am ic s p ro gr am .

100 0.1 121 09E+07 0. 89 39 29 0 .15 89 56E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189220E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115307E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304663E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619601E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175113E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263625E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377839E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MDN or ma l end o f e x e c u t i o n .

S ep te mb er 19 2011 3 : 3 7 : 0 5 . 8 4 3 PM151.26user 0.00 system 2:31 .28 elapsed 99%CPU (0 av gte xt +0avgdata 4688maxresid ent ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +329 m i no r ) p a g e f a u l t s 0 swaps

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 56 / 61

Practical OpenMP Example - Molecular Dynamics

Page 54: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 54/58

[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=1[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 3 8 : 4 8 . 7 8 8 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s t h r e a d 0 o f 1

100 0.1 121 09E+07 0. 89 39 29 0 .15 89 56E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189222E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115307E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304663E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619601E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175113E−08

800 0 .1 12 10 3E+07 5 9. 92 21 −0.263625E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377839E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MDN or ma l end o f e x e c u t i o n .

S ep te mb er 19 2011 4 : 4 1 : 4 1 . 7 1 6 PM172.90user 0.00 system 2:53 .12 elapsed 99%CPU (0 av gte xt +0avgdata 7632maxresid ent ) k

0 i n p u t s +0 o u t p u t s ( 0 m a j o r +519 m i no r ) p a g e f a u l t s 0 swaps

So the OpenMP overhead is reflected in S (1) 0.87.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 57 / 61

Practical OpenMP Example - Molecular Dynamics

Page 55: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 55/58

[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=2[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 2 2 : 4 6 . 5 3 8 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s t h r e a d 0 o f 2Th i s i s t h r e a d 1 o f 2

100 0.1 121 09E+07 0. 89 39 29 0 .15 88 83E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189195E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115309E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304673E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619611E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175112E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263624E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377839E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MDN or ma l end o f e x e c u t i o n .

S ep te mb er 19 2011 4 : 2 4 : 1 4 . 5 7 7 PM175.85user 0.03 system 1:28 .06 elapsed 199%CPU (0 avg tex t+0avgdata 7920maxreside nt ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +539 m i no r ) p a g e f a u l t s 0 swaps

For 2 threads, we are up to S (2) 1.7.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 58 / 61

Practical OpenMP Example - Molecular Dynamics

[ k14n08b : ~ / d omp ] $ e x p o r t OMP NUM THREADS=4

Page 56: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 56/58

[ k14n08b : / d_omp ] $ e x p o r t OMP_NUM_THREADS=4[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 3 1 : 1 9 . 4 0 9 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .

Th i s i s t h r e a d 0 o f 4Th i s i s t h r e a d 3 o f 4Th i s i s t h r e a d 2 o f 4Th i s i s t h r e a d 1 o f 4

100 0.1 121 09E+07 0. 89 39 29 0 .15 88 75E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189224E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115307E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304674E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619610E−09

600 0 .1 12 10 6E+07 3 3. 36 84−

0.109145E−

08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175112E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263624E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377839E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MDN or ma l end o f e x e c u t i o n .

S ep te mb er 19 2011 4 : 3 2 : 0 3 . 3 3 6 PM175.37user 0.06 system 0:44 .03 elapsed 398%CPU (0 avg tex t+0avgdata 8176maxreside nt ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +560 m i no r ) p a g e f a u l t s 0 swaps

For 4 threads, we are up to S (4) 3.4.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 59 / 61

Practical OpenMP Example - Molecular Dynamics

[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=8[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. xS ep te mb er 19 2011 4 : 3 3 : 1 6 0 0 0 PM

Page 57: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 57/58

S ep te mb er 19 2011 4 : 3 3 : 1 6 . 0 0 0 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s t h r e a d 0 o f 8Th i s i s t h r e a d 4 o f 8

Th i s i s t h r e a d 6 o f 8Th i s i s t h r e a d 5 o f 8Th i s i s t h r e a d 7 o f 8Th i s i s t h r e a d 3 o f 8Th i s i s t h r e a d 2 o f 8Th i s i s t h r e a d 1 o f 8

100 0.1 121 09E+07 0. 89 39 29 0 .15 88 56E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189249E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115308E−09

400 0 .1 12 10 7E+07 1 4. 70 04 −0.304673E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619612E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175113E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263624E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377840E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MD

N or ma l end o f e x e c u t i o n .

S ep te mb er 19 2011 4 : 3 3 : 3 7 . 9 0 9 PM174.90user 0.04 system 0:22 .01 elapsed 794%CPU (0 avg tex t+0avgdata 8768maxreside nt ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +606 m i no r ) p a g e f a u l t s 0 swaps

For 8 threads, we are up to S (8) 6.9.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 60 / 61

Practical OpenMP Example - Molecular Dynamics

[ k14n08b : ~ / d_omp ] $ e x p o r t OMP_NUM_THREADS=12[ k14n08b : ~ /d_omp] $ / usr / bin / t ime . / md1500. omp. x

Page 58: class09_OpenMP, II.pdf

7/29/2019 class09_OpenMP, II.pdf

http://slidepdf.com/reader/full/class09openmp-iipdf 58/58

S ep te mb er 19 2011 4 : 3 4 : 0 2 . 6 3 8 PMMD

A m o l e c u l a r d yn am ic s p ro gr am .Th i s i s th r e a d 0 o f 12

Th i s i s th r e a d 4 o f 12. . .Th i s i s t h r e a d 11 of 12Th i s i s th r e a d 3 o f 12Th i s i s th r e a d 2 o f 12Th i s i s th r e a d 1 o f 12

100 0.1 121 09E+07 0. 89 39 29 0 .15 88 59E−10200 0 .1 12 10 9E+07 3 .6 33 76 −0.189249E−10300 0 .1 12 10 8E+07 8 .2 30 09 −0.115308E−09400 0 .1 12 10 7E+07 1 4. 70 04 −0.304674E−09500 0 .1 12 10 7E+07 2 3. 06 92 −0.619612E−09600 0 .1 12 10 6E+07 3 3. 36 84 −0.109145E−08700 0 .1 12 10 4E+07 4 5. 63 72 −0.175112E−08800 0 .1 12 10 3E+07 5 9. 92 21 −0.263624E−08900 0 .1 12 10 1E+07 7 6. 27 78 −0.377840E−08

1000 0 .1 12 09 9E+07 9 4. 76 66 −0.521232E−08

MDN or ma l end o f e x e c u t i o n .

S ep te mb er 19 2011 4 : 3 4 : 1 7 . 1 6 9 PM173.98user 0.00 system 0:14 .60 elapsed 1191%CPU (0 av gte xt +0avgdata 17712max resident ) k0 i n p u t s +0 o u t p u t s ( 0 m a j o r +680 m i no r ) p a g e f a u l t s 0 swaps

For 12 threads (this is a 12-core node), we are up to S (12) 10.4.

M. D. Jones, Ph.D. (CCR/UB) Practical Issues in OpenMP HPC1 Fall 2012 61 / 61