237// program design10.3 11 parallel programming models · brid multicore parallel pro-gramming)...

237 // Program Design 10.3 Assessing parallel programs

11 Parallel programming models

Many different models for expressing parallelism in programming languages

• Actor model

– Erlang

– Scala

• Coordination languages

– Linda

• CSP-based (Communicating Se-quential Processes)

– FortranM

– Occam

• Dataflow

– SISAL (Streams and Itera-tion in a Single AssignmentLanguage)

• Distributed

– Sequoia

– Bloom

• Event-driven and hardware de-scription

238 // Program Design 10.3 Assessing parallel programs

– Verilog hardware descriptionlanguage (HDL)

• Functional

– Concurrent Haskell

• GPU languages

– CUDA

– OpenMP

– OpenACC

– OpenHMPP (HMPP for Hy-brid Multicore Parallel Pro-gramming)

• Logic programming

– Parlog

• Multi-threaded

– Clojure

• Object-oriented

– Charm++

– Smalltalk

• Message-passing

– MPI

– PVM

• Partitioned global address space(PGAS)

– High Performance Fortran(HPF)

239 // programming models 11.1 HPF

11.1 HPF

Partitioned global address space parallel programming modelFortran90 extension

• SPMD (Single Program Multiple Data) model

• each process operates with its own part of data

• HPF commands specify which processor gets which part of the data

• Concurrency is defined by HPF commands based on Fortran90

• HPF directives as comments:!HPF$ <directive>


HPF example: matrix multiplication� �PROGRAM ABmult

IMPLICIT NONEINTEGER , PARAMETER : : N = 100INTEGER , DIMENSION (N,N) : : A, B , CINTEGER : : i , j

! HPF$ PROCESSORS s q u a r e ( 2 , 2 )! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i , ∗ ) WITH C( i , ∗ )! r e p l i c a t e c o p i e s o f row A ( i , ∗ ) on to proc . s which compute C( i , j )! HPF$ ALIGN B(∗ , j ) WITH C(∗ , j )! r e p l i c a t e c o p i e s o f c o l . B (∗ , j ) ) on to proc . s which compute C( i , j )

A = 1B = 2C = 0DO i = 1 , N

DO j = 1 , N! A l l t h e work i s l o c a l due t o ALIGNsC( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) )

END DOEND DOWRITE(∗ ,∗ ) CEND�


HPF programming methodology

• Need to find balance between concurrency and communication

• the more processes the more communication

• aiming to

– find balanced load based from the owner calculates rule

– data locality

Easy to write a program in HPF but difficult to gain good efficiencyProgramming in HPF technique is more or less like this:

1. Write a correctly working serial program, test and debug it

2. add distribution directives introducing as less as possible communication

242 // programming models 11.2 OpenMP

11.2 OpenMP

OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openMP/)

• Programming model based on thread parallelism

• Helping tool for a programmer

• Based on compiler directivesC/C++, Fortran*

• Nested Parallelism is supported(though, not all implementations support it)

• Dynamic threads

• OpenMP has become a standard

http://www.llnl.gov/computing/tutorials/openMP/

http://www.llnl.gov/computing/tutorials/openMP/


• Fork-Join model


OpenMP Example: Matrix Multiplication:� �! h t t p : / / www. l l n l . gov / comput ing / t u t o r i a l s / openMP / e x e r c i s e . h tm lC∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗C OpenMp Example − Ma t r ix M u l t i p l y − F o r t r a n V e r s i o nC FILE : omp_mm . fC DESCRIPTION :C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP . Threads s h a r e row i t e r a t i o n sC a c c o r d i n g to a p r e d e f i n e d chunk s i z e .C LAST REVISED : 1 / 5 / 0 4 B l a i s e BarneyC∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

PROGRAM MATMULT

INTEGER NRA, NCA, NCB, TID , NTHREADS, I , J , K, CHUNK,+ OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

C number of rows in m a t r i x APARAMETER (NRA=62)

C number of columns in m a t r i x APARAMETER (NCA=15)

C number of columns in m a t r i x BPARAMETER (NCB=7)

REAL∗8 A(NRA,NCA) , B(NCA,NCB) , C(NRA,NCB)


C S e t loop i t e r a t i o n chunk s i z eCHUNK = 10

C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$OMP PARALLEL SHARED(A, B , C ,NTHREADS,CHUNK) PRIVATE ( TID , I , J ,K)

TID = OMP_GET_THREAD_NUM( )IF ( TID .EQ . 0 ) THEN

NTHREADS = OMP_GET_NUM_THREADS( )PRINT ∗ , ’ S t a r t i n g m a t r i x m u l t i p l e example wi th ’ , NTHREADS,

+ ’ t h r e a d s ’PRINT ∗ , ’ I n i t i a l i z i n g m a t r i c e s ’

END IF

C I n i t i a l i z e m a t r i c e s!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 30 I =1 , NRADO 30 J =1 , NCA

A( I , J ) = ( I−1) +( J−1)30 CONTINUE

!$OMP DO SCHEDULE( STATIC , CHUNK)DO 40 I =1 , NCA

DO 40 J =1 , NCBB( I , J ) = ( I−1)∗ ( J−1)

40 CONTINUE


!$OMP DO SCHEDULE( STATIC , CHUNK)DO 50 I =1 , NRA

DO 50 J =1 , NCBC( I , J ) = 0

50 CONTINUE

C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oopC D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s

PRINT ∗ , ’ Thread ’ , TID , ’ s t a r t i n g m a t r i x m u l t i p l y . . . ’!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 60 I =1 , NRAPRINT ∗ , ’ Thread ’ , TID , ’ d i d row ’ , I

DO 60 J =1 , NCBDO 60 K=1 , NCA

C( I , J ) = C( I , J ) + A( I ,K) ∗ B(K, J )60 CONTINUE

C End of p a r a l l e l r e g i o n!$OMP END PARALLEL

C P r i n t r e s u l t sPRINT ∗ , ’∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ’PRINT ∗ , ’ R e s u l t Ma t r i x : ’DO 90 I =1 , NRA


DO 80 J =1 , NCBWRITE(∗ , 7 0 ) C( I , J )

70 FORMAT(2 x , f8 . 2 , $ )80 CONTINUE

PRINT ∗ , ’ ’90 CONTINUE

PRINT ∗ , ’∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ’PRINT ∗ , ’ Done . ’

END�

248 // programming models 11.3 GPGPU

11.3 GPGPU

General-purpose computing on graphics processing units

11.3.1 CUDA

• Parallel programming platform andmodel created by NVIDIA

• CUDA gives access to

– virtual instruction set

– memory on parallel computa-tional elements on GPUs

• based on executing a large numberof threads simultaneously

• CUDA-accelerated libraries

• CUDA-extended programming lan-guages (like nvcc)

– C, C++, Fortran


CUDA Fortran Matrix Multiplication example (for more information, see http://www.pgroup.com/lit/articles/insider/v1n3a2.htm)

Host (CPU) code:

� �s u b r o u t i n e mmul ( A, B , C )

use c u d a f o rrea l , dimension ( : , : ) : : A, B , Ci n t e g e r : : N, M, Lrea l , d ev i ce , a l l o c a t a b l e , dimension ( : , : ) : : Adev , Bdev , Cdevtype ( dim3 ) : : dimGrid , dimBlockN = s i z e (A, 1 ) ; M = s i z e (A, 2 ) ; L = s i z e (B , 2 )a l l o c a t e ( Adev (N,M) , Bdev (M, L ) , Cdev (N, L ) )Adev = A( 1 : N, 1 :M)Bdev = B ( 1 :M, 1 : L )dimGrid = dim3 ( N/ 1 6 , L / 1 6 , 1 )dimBlock = dim3 ( 16 , 16 , 1 )c a l l mmul_kernel <<<dimGrid , dimBlock >>>( Adev , Bdev , Cdev , N,M, L )C ( 1 : N, 1 :M) = Cdevd e a l l o c a t e ( Adev , Bdev , Cdev )

end s u b r o u t i n e�

http://www.pgroup.com/lit/articles/insider/v1n3a2.htm

http://www.pgroup.com/lit/articles/insider/v1n3a2.htm


GPU code:� �a t t r i b u t e s ( g l o b a l ) s u b r o u t i n e MMUL_KERNEL( A, B , C , N,M, L )

rea l , d e v i c e : : A(N,M) ,B(M, L ) ,C(N, L )i n t e g e r , v a l u e : : N,M, Li n t e g e r : : i , j , kb , k , tx , t yrea l , s h a r e d : : Ab ( 1 6 , 1 6 ) , Bb ( 1 6 , 1 6 )r e a l : : C i jt x = t h r e a d i d x%x ; t y = t h r e a d i d x%yi = ( b l o c k i d x%x−1) ∗ 16 + t xj = ( b l o c k i d x%y−1) ∗ 16 + t yC i j = 0 . 0do kb = 1 , M, 16

! Fe tch one e l e m e n t each i n t o Ab and Bb ; n o t e t h a t 16 x16 = 256! t h r e a d s i n t h i s t h read−b l o c k are f e t c h i n g s e p a r a t e e l e m e n t s! o f Ab and BbAb ( tx , t y ) = A( i , kb+ ty −1)Bb ( tx , t y ) = B( kb+ tx −1, j )! Wait u n t i l a l l e l e m e n t s o f Ab and Bb are f i l l e dc a l l s y n c t h r e a d s ( )do k = 1 , 16

C i j = C i j + Ab ( tx , k ) ∗ Bb ( k , t y )enddo! Wait u n t i l a l l t h r e a d s i n t h e th read−b l o c k f i n i s h w i t h


! t h i s i t e r a t i o n ’ s Ab and Bbc a l l s y n c t h r e a d s ( )

enddoC( i , j ) = C i jend s u b r o u t i n e�

11.3.2 OpenCL

Open Computing Language (OpenCL):

• Heterogeneous systems of

– CPUs (central processing units)

– GPUs (graphics processing units)

– DSPs (digital signal processors)

– FPGAs (field-programmable gate arrays)

– and other processors


• language (based on C99)

– kernels (functions that execute on OpenCL devices)

– plus application programming interfaces (APIs)

– ( fortrancl )

• parallel computing using

– task-based and data-based parallelism

– GPGPU

• open standard by Khronos Group (Apple, AMD, IBM, Intel and Nvidia)

– + adopted by Altera, Samsung, Vivante and ARM Holdings

Example: Matrix Multiplication


� �/∗ k e r n e l . c l∗ Ma tr i x m u l t i p l i c a t i o n : C = A ∗ B . ( ( Dev ice code ) )( h t t p : / / gpgpu−comput ing4 . b l o g s p o t . co . uk / 2 0 0 9 / 0 9 / ma t r i x−m u l t i p l i c a t i o n −2−o p e n c l . h tm l )

∗ // / OpenCL Ke rn e l_ _ k e r n e l voidmatr ixMul ( _ _ g l o b a l f l o a t ∗ C ,

_ _ g l o b a l f l o a t ∗ A,_ _ g l o b a l f l o a t ∗ B ,i n t wA, i n t wB)

{ / / 2D Thread ID/ / Old CUDA code/ / i n t t x = b l o c k I d x . x ∗ TILE_SIZE + t h r e a d I d x . x ;/ / i n t t y = b l o c k I d x . y ∗ TILE_SIZE + t h r e a d I d x . y ;i n t t x = g e t _ g l o b a l _ i d ( 0 ) ;i n t t y = g e t _ g l o b a l _ i d ( 1 ) ;/ / v a l u e s t o r e s t h e e l e m e n t t h a t i s/ / computed by t h e t h r e a df l o a t v a l u e = 0 ;f o r ( i n t k = 0 ; k < wA; ++k ){

f l o a t elementA = A[ t y ∗ wA + k ] ;f l o a t elementB = B[ k ∗ wB + t x ] ;


v a l u e += elementA ∗ elementB ;}/ / W r i t e t h e m a t r i x t o d e v i c e memory each/ / t h r e a d w r i t e s one e l e m e n tC[ t y ∗ wA + t x ] = v a l u e ;

}�

11.3.3 OpenACC

• standard (Cray, CAPS, Nvidia and PGI) to simplify parallel programming ofheterogeneous CPU/GPU systems

• using annotations (like OpenMP), C, C++, Fortran source code

• code started on both CPU and GPU automatically

• OpenACC to merge into OpenMP in a future release of OpenMP


� �! A s i m p l e OpenACC k e r n e l f o r M at r i x M u l t i p l i c a t i o n! $acc k e r n e l s

do k = 1 , n1do i = 1 , n3

c ( i , k ) = 0 . 0do j = 1 , n2

c ( i , k ) = c ( i , k ) + a ( i , j ) ∗ b ( j , k )enddo

enddoenddo

! $acc end k e r n e l s��

! h t t p : / / www. bu . edu / t e c h / r e s e a r c h / c o m p u t a t i o n / l i n u x−c l u s t e r / ka tana−c l u s t e r / gpu−comput ing / openacc−f o r t r a n / ma t r i x−m u l t i p l y−f o r t r a n /

program m a t r i x _ m u l t i p l yuse omp_l ibuse openacci m p l i c i t nonei n t e g e r : : i , j , k , myid , m, n , c o m p i l e d _ f o r , o p t i o ni n t e g e r , parameter : : f d = 11i n t e g e r : : t1 , t2 , d t , c o u n t _ r a t e , count_maxrea l , a l l o c a t a b l e , dimension ( : , : ) : : a , b , c


r e a l : : tmp , s e c sopen ( fd , f i l e = ’ w a l l c l o c k t i m e ’ , form= ’ f o r m a t t e d ’ )o p t i o n = c o m p i l e d _ f o r ( fd ) ! 1− s e r i a l , 2−OpenMP , 3−OpenACC , 4−bo th

! $omp p a r a l l e l! $ myid = OMP_GET_THREAD_NUM( )! $ i f ( myid . eq . 0 ) t h e n! $ w r i t e ( fd , " ( ’ Number o f p r o c s i s ’ , i 4 ) " ) OMP_GET_NUM_THREADS( )! $ e n d i f! $omp end p a r a l l e l

c a l l sys t em_c lock ( count_max=count_max , c o u n t _ r a t e = c o u n t _ r a t e )do m=1 ,4 ! compute f o r d i f f e r e n t s i z e m a t r i x m u l t i p l i e s

c a l l sys t em_c lock ( t 1 )n = 1000∗2∗∗(m−1) ! 1000 , 2000 , 4000 , 8000a l l o c a t e ( a ( n , n ) , b ( n , n ) , c ( n , n ) )

! I n i t i a l i z e m a t r i c e sdo j =1 , n

do i =1 , na ( i , j ) = r e a l ( i + j )b ( i , j ) = r e a l ( i − j )

enddoenddo

! $omp p a r a l l e l do s h a r e d ( a , b , c , n , tmp ) r e d u c t i o n ( + : tmp )! $ acc d a t a co py in ( a , b ) copy ( c )! $ acc k e r n e l s


! Compute m a t r i x m u l t i p l i c a t i o n .do j =1 , n

do i =1 , ntmp = 0 . 0 ! e n a b l e s ACC p a r a l l e l i s m f o r k−l oopdo k =1 , n

tmp = tmp + a ( i , k ) ∗ b ( k , j )enddoc ( i , j ) = tmp

enddoenddo

! $ acc end k e r n e l s! $ acc end d a t a! $omp end p a r a l l e l do

c a l l sys t em_c lock ( t 2 )d t = t2−t 1s e c s = r e a l ( d t ) / r e a l ( c o u n t _ r a t e )w r i t e ( fd , " ( ’ For n = ’ , i4 , ’ , w a l l c l o c k t ime i s ’ , f12 . 2 , ’ seconds ’ ) " ) &

n , s e c sd e a l l o c a t e ( a , b , c )

enddoc l o s e ( fd )

end program m a t r i x _ m u l t i p l y

i n t e g e r f u n c t i o n c o m p i l e d _ f o r ( fd )


i m p l i c i t nonei n t e g e r : : f d# i f d e f i n e d _OPENMP && d e f i n e d _OPENACC

c o m p i l e d _ f o r = 4w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenMP & OpenACC ’ ) " )

# e l i f d e f i n e d _OPENACCc o m p i l e d _ f o r = 3w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenACC ’ ) " )

# e l i f d e f i n e d _OPENMPc o m p i l e d _ f o r = 2w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenMP ’ ) " )

# e l s ec o m p i l e d _ f o r = 1w r i t e ( fd , " ( ’ Th i s code i s compi l ed f o r s e r i a l o p e r a t i o n s ’ ) " )

# e n d i fend f u n c t i o n c o m p i l e d _ f o r�

237// program design10.3 11 parallel programming models · brid multicore parallel pro-gramming)...

Documents